{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a class=\"anchor\" id=\"0\"></a>\n",
    "# **Naive Bayes Classifier in Python**\n",
    "\n",
    "\n",
    "Hello friends,\n",
    "\n",
    "In machine learning, Naïve Bayes classification is a straightforward and powerful algorithm for the classification task. In this kernel, I implement Naive Bayes Classification algorithm with Python and Scikit-Learn. I build a Naive Bayes Classifier to predict whether a person makes over 50K a year. \n",
    "\n",
    "So, let's get started."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**As always, I hope you find this kernel useful and your <font color=\"red\"><b>UPVOTES</b></font> would be highly appreciated**.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a class=\"anchor\" id=\"0.1\"></a>\n",
    "# **Table of Contents**\n",
    "\n",
    "1.\t[Introduction to Naive Bayes algorithm](#1)\n",
    "2.\t[Naive Bayes algorithm intuition](#2)\n",
    "3.\t[Types of Naive Bayes algorithm](#3)\n",
    "4.\t[Applications of Naive Bayes algorithm](#4)\n",
    "5.\t[Import libraries](#5)\n",
    "6.\t[Import dataset](#6)\n",
    "7.\t[Exploratory data analysis](#7)\n",
    "8.\t[Declare feature vector and target variable](#8)\n",
    "9.\t[Split data into separate training and test set](#9)\n",
    "10.\t[Feature engineering](#10)\n",
    "11.\t[Feature scaling](#11)\n",
    "12.\t[Model training](#12)\n",
    "13.\t[Predict the results](#13)\n",
    "14.\t[Check accuracy score](#14)\n",
    "15.\t[Confusion matrix](#15)\n",
    "16.\t[Classification metrices](#16)\n",
    "17.\t[Calculate class probabilities](#17)\n",
    "18.\t[ROC - AUC](#18)\n",
    "19.\t[k-Fold Cross Validation](#19)\n",
    "20.\t[Results and conclusion](#20)\n",
    "21. [References](#21)\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **1. Introduction to Naive Bayes algorithm** <a class=\"anchor\" id=\"1\"></a>\n",
    "\n",
    "[Table of Contents](#0.1)\n",
    "\n",
    "\n",
    "In machine learning, Naïve Bayes classification is a straightforward and powerful algorithm for the classification task. Naïve Bayes classification is based on applying Bayes’ theorem with strong independence assumption between the features.  Naïve Bayes classification produces good results when we use it for textual data analysis such as Natural Language Processing.\n",
    "\n",
    "\n",
    "Naïve Bayes models are also known as `simple Bayes` or `independent Bayes`. All these names refer to the application of Bayes’ theorem in the classifier’s decision rule. Naïve Bayes classifier applies the Bayes’ theorem in practice. This classifier brings the power of Bayes’ theorem to machine learning.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **2. Naive Bayes algorithm intuition** <a class=\"anchor\" id=\"2\"></a>\n",
    "\n",
    "[Table of Contents](#0.1)\n",
    "\n",
    "\n",
    "Naïve Bayes Classifier uses the Bayes’ theorem to predict membership probabilities for each class such as the probability that given record or data point belongs to a particular class. The class with the highest probability is considered as the most likely class. This is also known as the **Maximum A Posteriori (MAP)**. \n",
    "\n",
    "The **MAP for a hypothesis with 2 events A and B is**\n",
    "\n",
    "**MAP (A)**\n",
    "\n",
    "= max (P (A | B))\n",
    "\n",
    "= max (P (B | A) * P (A))/P (B)\n",
    "\n",
    "= max (P (B | A) * P (A))\n",
    "\n",
    "\n",
    "Here, P (B) is evidence probability. It is used to normalize the result. It remains the same, So, removing it would not affect the result.\n",
    "\n",
    "\n",
    "Naïve Bayes Classifier assumes that all the features are unrelated to each other. Presence or absence of a feature does not influence the presence or absence of any other feature. \n",
    "\n",
    "\n",
    "In real world datasets, we test a hypothesis given multiple evidence on features. So, the calculations become quite complicated. To simplify the work, the feature independence approach is used to uncouple multiple evidence and treat each as an independent one.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **3. Types of Naive Bayes algorithm** <a class=\"anchor\" id=\"3\"></a>\n",
    "\n",
    "[Table of Contents](#0.1)\n",
    "\n",
    "\n",
    "There are 3 types of Naïve Bayes algorithm. The 3 types are listed below:-\n",
    "\n",
    "  1. Gaussian Naïve Bayes\n",
    "\n",
    "  2. Multinomial Naïve Bayes\n",
    "\n",
    "  3. Bernoulli Naïve Bayes\n",
    "\n",
    "These 3 types of algorithm are explained below.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **Gaussian Naïve Bayes algorithm**\n",
    "\n",
    "\n",
    "When we have continuous attribute values, we made an assumption that the values associated with each class are distributed according to Gaussian or Normal distribution. For example, suppose the training data contains a continuous attribute x. We first segment the data by the class, and then compute the mean and variance of x in each class. Let µi be the mean of the values and let σi be the variance of the values associated with the ith class. Suppose we have some observation value xi . Then, the probability distribution of xi given a class can be computed by the following equation –\n",
    "\n",
    "\n",
    "![Gaussian Naive Bayes algorithm](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQEWCcq1XtC1Yw20KWSHn2axYa7eY-a0T1TGtdVn5PvOpv9wW3FeA&s)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **Multinomial Naïve Bayes algorithm**\n",
    "\n",
    "With a Multinomial Naïve Bayes model, samples (feature vectors) represent the frequencies with which certain events have been generated by a multinomial (p1, . . . ,pn) where pi is the probability that event i occurs. Multinomial Naïve Bayes algorithm is preferred to use on data that is multinomially distributed. It is one of the standard algorithms which is used in text categorization classification."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **Bernoulli Naïve Bayes algorithm**\n",
    "\n",
    "In the multivariate Bernoulli event model, features are independent boolean variables (binary variables) describing inputs. Just like the multinomial model, this model is also popular for document classification tasks where binary term occurrence features are used rather than term frequencies."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **4. Applications of Naive Bayes algorithm** <a class=\"anchor\" id=\"4\"></a>\n",
    "\n",
    "[Table of Contents](#0.1)\n",
    "\n",
    "\n",
    "\n",
    "Naïve Bayes is one of the most straightforward and fast classification algorithm. It is very well suited for large volume of data. It is successfully used in various applications such as :\n",
    "\n",
    "1. Spam filtering\n",
    "2. Text classification\n",
    "3. Sentiment analysis\n",
    "4. Recommender systems\n",
    "\n",
    "It uses Bayes theorem of probability for prediction of unknown class.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **5. Import libraries** <a class=\"anchor\" id=\"5\"></a>\n",
    "\n",
    "[Table of Contents](#0.1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "trusted": true
   },
   "outputs": [],
   "source": [
    "# This Python 3 environment comes with many helpful analytics libraries installed\n",
    "# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python\n",
    "# For example, here's several helpful packages to load in \n",
    "import numpy as np # linear algebra\n",
    "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n",
    "import matplotlib.pyplot as plt # for data visualization purposes\n",
    "import seaborn as sns # for statistical data visualization\n",
    "%matplotlib inline\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "trusted": true
   },
   "outputs": [],
   "source": [
    "import warnings\n",
    "\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **6. Import dataset** <a class=\"anchor\" id=\"6\"></a>\n",
    "\n",
    "[Table of Contents](#0.1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "trusted": true
   },
   "outputs": [],
   "source": [
    "data = './adult.csv'\n",
    "\n",
    "df = pd.read_csv(data, header=None, sep=',\\s')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **7. Exploratory data analysis** <a class=\"anchor\" id=\"7\"></a>\n",
    "\n",
    "[Table of Contents](#0.1)\n",
    "\n",
    "\n",
    "Now, I will explore the data to gain insights about the data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(32561, 15)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# view dimensions of dataset\n",
    "\n",
    "df.shape"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that there are 32561 instances and 15 attributes in the data set."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### View top 5 rows of dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>10</th>\n",
       "      <th>11</th>\n",
       "      <th>12</th>\n",
       "      <th>13</th>\n",
       "      <th>14</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>39</td>\n",
       "      <td>State-gov</td>\n",
       "      <td>77516</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Adm-clerical</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>2174</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>50</td>\n",
       "      <td>Self-emp-not-inc</td>\n",
       "      <td>83311</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Exec-managerial</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>13</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>38</td>\n",
       "      <td>Private</td>\n",
       "      <td>215646</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Divorced</td>\n",
       "      <td>Handlers-cleaners</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>53</td>\n",
       "      <td>Private</td>\n",
       "      <td>234721</td>\n",
       "      <td>11th</td>\n",
       "      <td>7</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Handlers-cleaners</td>\n",
       "      <td>Husband</td>\n",
       "      <td>Black</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>28</td>\n",
       "      <td>Private</td>\n",
       "      <td>338409</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Wife</td>\n",
       "      <td>Black</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>Cuba</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   0                 1       2          3   4                   5   \\\n",
       "0  39         State-gov   77516  Bachelors  13       Never-married   \n",
       "1  50  Self-emp-not-inc   83311  Bachelors  13  Married-civ-spouse   \n",
       "2  38           Private  215646    HS-grad   9            Divorced   \n",
       "3  53           Private  234721       11th   7  Married-civ-spouse   \n",
       "4  28           Private  338409  Bachelors  13  Married-civ-spouse   \n",
       "\n",
       "                  6              7      8       9     10  11  12  \\\n",
       "0       Adm-clerical  Not-in-family  White    Male  2174   0  40   \n",
       "1    Exec-managerial        Husband  White    Male     0   0  13   \n",
       "2  Handlers-cleaners  Not-in-family  White    Male     0   0  40   \n",
       "3  Handlers-cleaners        Husband  Black    Male     0   0  40   \n",
       "4     Prof-specialty           Wife  Black  Female     0   0  40   \n",
       "\n",
       "              13     14  \n",
       "0  United-States  <=50K  \n",
       "1  United-States  <=50K  \n",
       "2  United-States  <=50K  \n",
       "3  United-States  <=50K  \n",
       "4           Cuba  <=50K  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# preview the dataset\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Rename column names\n",
    "\n",
    "We can see that the dataset does not have proper column names. The columns are merely labelled as 0,1,2.... and so on. We should give proper names to the columns. I will do it as follows:-"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "age，年龄  \n",
    "workclass，工作类别（如州立政府工作人员，私自）  \n",
    "fnlwgt，  \n",
    "education，教育  \n",
    "education_num  \n",
    "marital_status，婚姻家庭情况  \n",
    "occupation，  \n",
    "relationship，家庭关系  \n",
    "race，种族    \n",
    "sex，行别（gender）  \n",
    "capital_gain  \n",
    "capital_loss  \n",
    "hours_per_week，工作时间  \n",
    "native_country，国籍  \n",
    "income，收入  \n",
    "\n",
    "categorical proprty： ['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country', 'income']\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['age', 'workclass', 'fnlwgt', 'education', 'education_num',\n",
       "       'marital_status', 'occupation', 'relationship', 'race', 'sex',\n",
       "       'capital_gain', 'capital_loss', 'hours_per_week', 'native_country',\n",
       "       'income'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 给dataframe表格的表头重新命名\n",
    "col_names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship',\n",
    "             'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']\n",
    "\n",
    "df.columns = col_names\n",
    "\n",
    "df.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>workclass</th>\n",
       "      <th>fnlwgt</th>\n",
       "      <th>education</th>\n",
       "      <th>education_num</th>\n",
       "      <th>marital_status</th>\n",
       "      <th>occupation</th>\n",
       "      <th>relationship</th>\n",
       "      <th>race</th>\n",
       "      <th>sex</th>\n",
       "      <th>capital_gain</th>\n",
       "      <th>capital_loss</th>\n",
       "      <th>hours_per_week</th>\n",
       "      <th>native_country</th>\n",
       "      <th>income</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>39</td>\n",
       "      <td>State-gov</td>\n",
       "      <td>77516</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Adm-clerical</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>2174</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>50</td>\n",
       "      <td>Self-emp-not-inc</td>\n",
       "      <td>83311</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Exec-managerial</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>13</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>38</td>\n",
       "      <td>Private</td>\n",
       "      <td>215646</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Divorced</td>\n",
       "      <td>Handlers-cleaners</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>53</td>\n",
       "      <td>Private</td>\n",
       "      <td>234721</td>\n",
       "      <td>11th</td>\n",
       "      <td>7</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Handlers-cleaners</td>\n",
       "      <td>Husband</td>\n",
       "      <td>Black</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>28</td>\n",
       "      <td>Private</td>\n",
       "      <td>338409</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Wife</td>\n",
       "      <td>Black</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>Cuba</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   age         workclass  fnlwgt  education  education_num  \\\n",
       "0   39         State-gov   77516  Bachelors             13   \n",
       "1   50  Self-emp-not-inc   83311  Bachelors             13   \n",
       "2   38           Private  215646    HS-grad              9   \n",
       "3   53           Private  234721       11th              7   \n",
       "4   28           Private  338409  Bachelors             13   \n",
       "\n",
       "       marital_status         occupation   relationship   race     sex  \\\n",
       "0       Never-married       Adm-clerical  Not-in-family  White    Male   \n",
       "1  Married-civ-spouse    Exec-managerial        Husband  White    Male   \n",
       "2            Divorced  Handlers-cleaners  Not-in-family  White    Male   \n",
       "3  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male   \n",
       "4  Married-civ-spouse     Prof-specialty           Wife  Black  Female   \n",
       "\n",
       "   capital_gain  capital_loss  hours_per_week native_country income  \n",
       "0          2174             0              40  United-States  <=50K  \n",
       "1             0             0              13  United-States  <=50K  \n",
       "2             0             0              40  United-States  <=50K  \n",
       "3             0             0              40  United-States  <=50K  \n",
       "4             0             0              40           Cuba  <=50K  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's again preview the dataset\n",
    "# 显示表格的前5行数据\n",
    "df.head()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that the column names are renamed. Now, the columns have meaningful names."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### View summary of dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 32561 entries, 0 to 32560\n",
      "Data columns (total 15 columns):\n",
      " #   Column          Non-Null Count  Dtype \n",
      "---  ------          --------------  ----- \n",
      " 0   age             32561 non-null  int64 \n",
      " 1   workclass       32561 non-null  object\n",
      " 2   fnlwgt          32561 non-null  int64 \n",
      " 3   education       32561 non-null  object\n",
      " 4   education_num   32561 non-null  int64 \n",
      " 5   marital_status  32561 non-null  object\n",
      " 6   occupation      32561 non-null  object\n",
      " 7   relationship    32561 non-null  object\n",
      " 8   race            32561 non-null  object\n",
      " 9   sex             32561 non-null  object\n",
      " 10  capital_gain    32561 non-null  int64 \n",
      " 11  capital_loss    32561 non-null  int64 \n",
      " 12  hours_per_week  32561 non-null  int64 \n",
      " 13  native_country  32561 non-null  object\n",
      " 14  income          32561 non-null  object\n",
      "dtypes: int64(6), object(9)\n",
      "memory usage: 3.7+ MB\n"
     ]
    }
   ],
   "source": [
    "# view summary of dataset\n",
    "# dataframe的基本信息，每一列的名称，非空值多少，数据类型是什么\n",
    "df.info()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that there are no missing values in the dataset. I will confirm this further."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Types of variables\n",
    "\n",
    "\n",
    "In this section, I segregate the dataset into categorical and numerical variables. There are a mixture of categorical and numerical variables in the dataset. Categorical variables have data type object. Numerical variables have data type int64.\n",
    "\n",
    "\n",
    "First of all, I will explore categorical variables."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Explore categorical variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are 9 categorical variables\n",
      "\n",
      "The categorical variables are :\n",
      "\n",
      " ['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country', 'income']\n"
     ]
    }
   ],
   "source": [
    "# find categorical variables\n",
    "# 哪些属性列属于类别列？\n",
    "\n",
    "categorical = [var for var in df.columns if df[var].dtype=='O']\n",
    "\n",
    "print('There are {} categorical variables\\n'.format(len(categorical)))\n",
    "\n",
    "print('The categorical variables are :\\n\\n', categorical)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>workclass</th>\n",
       "      <th>education</th>\n",
       "      <th>marital_status</th>\n",
       "      <th>occupation</th>\n",
       "      <th>relationship</th>\n",
       "      <th>race</th>\n",
       "      <th>sex</th>\n",
       "      <th>native_country</th>\n",
       "      <th>income</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>State-gov</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Adm-clerical</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Self-emp-not-inc</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Exec-managerial</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Private</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>Divorced</td>\n",
       "      <td>Handlers-cleaners</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Private</td>\n",
       "      <td>11th</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Handlers-cleaners</td>\n",
       "      <td>Husband</td>\n",
       "      <td>Black</td>\n",
       "      <td>Male</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Private</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Wife</td>\n",
       "      <td>Black</td>\n",
       "      <td>Female</td>\n",
       "      <td>Cuba</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          workclass  education      marital_status         occupation  \\\n",
       "0         State-gov  Bachelors       Never-married       Adm-clerical   \n",
       "1  Self-emp-not-inc  Bachelors  Married-civ-spouse    Exec-managerial   \n",
       "2           Private    HS-grad            Divorced  Handlers-cleaners   \n",
       "3           Private       11th  Married-civ-spouse  Handlers-cleaners   \n",
       "4           Private  Bachelors  Married-civ-spouse     Prof-specialty   \n",
       "\n",
       "    relationship   race     sex native_country income  \n",
       "0  Not-in-family  White    Male  United-States  <=50K  \n",
       "1        Husband  White    Male  United-States  <=50K  \n",
       "2  Not-in-family  White    Male  United-States  <=50K  \n",
       "3        Husband  Black    Male  United-States  <=50K  \n",
       "4           Wife  Black  Female           Cuba  <=50K  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# view the categorical variables\n",
    "# 类别列属性的前几行数据\n",
    "df[categorical].head()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Summary of categorical variables\n",
    "\n",
    "\n",
    "- There are 9 categorical variables. \n",
    "\n",
    "\n",
    "- The categorical variables are given by `workclass`, `education`, `marital_status`, `occupation`, `relationship`, `race`, `sex`, `native_country` and `income`.\n",
    "\n",
    "\n",
    "- `income` is the target variable."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Explore problems within categorical variables\n",
    "\n",
    "\n",
    "First, I will explore the categorical variables.\n",
    "\n",
    "\n",
    "### Missing values in categorical variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "workclass         0\n",
       "education         0\n",
       "marital_status    0\n",
       "occupation        0\n",
       "relationship      0\n",
       "race              0\n",
       "sex               0\n",
       "native_country    0\n",
       "income            0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check missing values in categorical variables\n",
    "# 类别列中是否存在着空值\n",
    "df[categorical].isnull().sum()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that there are no missing values in the categorical variables. I will confirm this further."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Frequency counts of categorical variables\n",
    "\n",
    "\n",
    "Now, I will check the frequency counts of categorical variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "workclass\n",
      "Private             22696\n",
      "Self-emp-not-inc     2541\n",
      "Local-gov            2093\n",
      "?                    1836\n",
      "State-gov            1298\n",
      "Self-emp-inc         1116\n",
      "Federal-gov           960\n",
      "Without-pay            14\n",
      "Never-worked            7\n",
      "Name: count, dtype: int64\n",
      "education\n",
      "HS-grad         10501\n",
      "Some-college     7291\n",
      "Bachelors        5355\n",
      "Masters          1723\n",
      "Assoc-voc        1382\n",
      "11th             1175\n",
      "Assoc-acdm       1067\n",
      "10th              933\n",
      "7th-8th           646\n",
      "Prof-school       576\n",
      "9th               514\n",
      "12th              433\n",
      "Doctorate         413\n",
      "5th-6th           333\n",
      "1st-4th           168\n",
      "Preschool          51\n",
      "Name: count, dtype: int64\n",
      "marital_status\n",
      "Married-civ-spouse       14976\n",
      "Never-married            10683\n",
      "Divorced                  4443\n",
      "Separated                 1025\n",
      "Widowed                    993\n",
      "Married-spouse-absent      418\n",
      "Married-AF-spouse           23\n",
      "Name: count, dtype: int64\n",
      "occupation\n",
      "Prof-specialty       4140\n",
      "Craft-repair         4099\n",
      "Exec-managerial      4066\n",
      "Adm-clerical         3770\n",
      "Sales                3650\n",
      "Other-service        3295\n",
      "Machine-op-inspct    2002\n",
      "?                    1843\n",
      "Transport-moving     1597\n",
      "Handlers-cleaners    1370\n",
      "Farming-fishing       994\n",
      "Tech-support          928\n",
      "Protective-serv       649\n",
      "Priv-house-serv       149\n",
      "Armed-Forces            9\n",
      "Name: count, dtype: int64\n",
      "relationship\n",
      "Husband           13193\n",
      "Not-in-family      8305\n",
      "Own-child          5068\n",
      "Unmarried          3446\n",
      "Wife               1568\n",
      "Other-relative      981\n",
      "Name: count, dtype: int64\n",
      "race\n",
      "White                 27816\n",
      "Black                  3124\n",
      "Asian-Pac-Islander     1039\n",
      "Amer-Indian-Eskimo      311\n",
      "Other                   271\n",
      "Name: count, dtype: int64\n",
      "sex\n",
      "Male      21790\n",
      "Female    10771\n",
      "Name: count, dtype: int64\n",
      "native_country\n",
      "United-States                 29170\n",
      "Mexico                          643\n",
      "?                               583\n",
      "Philippines                     198\n",
      "Germany                         137\n",
      "Canada                          121\n",
      "Puerto-Rico                     114\n",
      "El-Salvador                     106\n",
      "India                           100\n",
      "Cuba                             95\n",
      "England                          90\n",
      "Jamaica                          81\n",
      "South                            80\n",
      "China                            75\n",
      "Italy                            73\n",
      "Dominican-Republic               70\n",
      "Vietnam                          67\n",
      "Guatemala                        64\n",
      "Japan                            62\n",
      "Poland                           60\n",
      "Columbia                         59\n",
      "Taiwan                           51\n",
      "Haiti                            44\n",
      "Iran                             43\n",
      "Portugal                         37\n",
      "Nicaragua                        34\n",
      "Peru                             31\n",
      "France                           29\n",
      "Greece                           29\n",
      "Ecuador                          28\n",
      "Ireland                          24\n",
      "Hong                             20\n",
      "Cambodia                         19\n",
      "Trinadad&Tobago                  19\n",
      "Laos                             18\n",
      "Thailand                         18\n",
      "Yugoslavia                       16\n",
      "Outlying-US(Guam-USVI-etc)       14\n",
      "Honduras                         13\n",
      "Hungary                          13\n",
      "Scotland                         12\n",
      "Holand-Netherlands                1\n",
      "Name: count, dtype: int64\n",
      "income\n",
      "<=50K    24720\n",
      ">50K      7841\n",
      "Name: count, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "# view frequency counts of values in categorical variables\n",
    "# categorical为类别列名列表集合\n",
    "for var in categorical: \n",
    "    print(df[var].value_counts())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "workclass\n",
      "Private             22696\n",
      "Self-emp-not-inc     2541\n",
      "Local-gov            2093\n",
      "?                    1836\n",
      "State-gov            1298\n",
      "Self-emp-inc         1116\n",
      "Federal-gov           960\n",
      "Without-pay            14\n",
      "Never-worked            7\n",
      "Name: count, dtype: int64\n",
      "workclass\n",
      "Private             0.697030\n",
      "Self-emp-not-inc    0.078038\n",
      "Local-gov           0.064279\n",
      "?                   0.056386\n",
      "State-gov           0.039864\n",
      "Self-emp-inc        0.034274\n",
      "Federal-gov         0.029483\n",
      "Without-pay         0.000430\n",
      "Never-worked        0.000215\n",
      "Name: count, dtype: float64\n",
      "education\n",
      "HS-grad         10501\n",
      "Some-college     7291\n",
      "Bachelors        5355\n",
      "Masters          1723\n",
      "Assoc-voc        1382\n",
      "11th             1175\n",
      "Assoc-acdm       1067\n",
      "10th              933\n",
      "7th-8th           646\n",
      "Prof-school       576\n",
      "9th               514\n",
      "12th              433\n",
      "Doctorate         413\n",
      "5th-6th           333\n",
      "1st-4th           168\n",
      "Preschool          51\n",
      "Name: count, dtype: int64\n",
      "education\n",
      "HS-grad         0.322502\n",
      "Some-college    0.223918\n",
      "Bachelors       0.164461\n",
      "Masters         0.052916\n",
      "Assoc-voc       0.042443\n",
      "11th            0.036086\n",
      "Assoc-acdm      0.032769\n",
      "10th            0.028654\n",
      "7th-8th         0.019840\n",
      "Prof-school     0.017690\n",
      "9th             0.015786\n",
      "12th            0.013298\n",
      "Doctorate       0.012684\n",
      "5th-6th         0.010227\n",
      "1st-4th         0.005160\n",
      "Preschool       0.001566\n",
      "Name: count, dtype: float64\n",
      "marital_status\n",
      "Married-civ-spouse       14976\n",
      "Never-married            10683\n",
      "Divorced                  4443\n",
      "Separated                 1025\n",
      "Widowed                    993\n",
      "Married-spouse-absent      418\n",
      "Married-AF-spouse           23\n",
      "Name: count, dtype: int64\n",
      "marital_status\n",
      "Married-civ-spouse       0.459937\n",
      "Never-married            0.328092\n",
      "Divorced                 0.136452\n",
      "Separated                0.031479\n",
      "Widowed                  0.030497\n",
      "Married-spouse-absent    0.012837\n",
      "Married-AF-spouse        0.000706\n",
      "Name: count, dtype: float64\n",
      "occupation\n",
      "Prof-specialty       4140\n",
      "Craft-repair         4099\n",
      "Exec-managerial      4066\n",
      "Adm-clerical         3770\n",
      "Sales                3650\n",
      "Other-service        3295\n",
      "Machine-op-inspct    2002\n",
      "?                    1843\n",
      "Transport-moving     1597\n",
      "Handlers-cleaners    1370\n",
      "Farming-fishing       994\n",
      "Tech-support          928\n",
      "Protective-serv       649\n",
      "Priv-house-serv       149\n",
      "Armed-Forces            9\n",
      "Name: count, dtype: int64\n",
      "occupation\n",
      "Prof-specialty       0.127146\n",
      "Craft-repair         0.125887\n",
      "Exec-managerial      0.124873\n",
      "Adm-clerical         0.115783\n",
      "Sales                0.112097\n",
      "Other-service        0.101195\n",
      "Machine-op-inspct    0.061485\n",
      "?                    0.056601\n",
      "Transport-moving     0.049046\n",
      "Handlers-cleaners    0.042075\n",
      "Farming-fishing      0.030527\n",
      "Tech-support         0.028500\n",
      "Protective-serv      0.019932\n",
      "Priv-house-serv      0.004576\n",
      "Armed-Forces         0.000276\n",
      "Name: count, dtype: float64\n",
      "relationship\n",
      "Husband           13193\n",
      "Not-in-family      8305\n",
      "Own-child          5068\n",
      "Unmarried          3446\n",
      "Wife               1568\n",
      "Other-relative      981\n",
      "Name: count, dtype: int64\n",
      "relationship\n",
      "Husband           0.405178\n",
      "Not-in-family     0.255060\n",
      "Own-child         0.155646\n",
      "Unmarried         0.105832\n",
      "Wife              0.048156\n",
      "Other-relative    0.030128\n",
      "Name: count, dtype: float64\n",
      "race\n",
      "White                 27816\n",
      "Black                  3124\n",
      "Asian-Pac-Islander     1039\n",
      "Amer-Indian-Eskimo      311\n",
      "Other                   271\n",
      "Name: count, dtype: int64\n",
      "race\n",
      "White                 0.854274\n",
      "Black                 0.095943\n",
      "Asian-Pac-Islander    0.031909\n",
      "Amer-Indian-Eskimo    0.009551\n",
      "Other                 0.008323\n",
      "Name: count, dtype: float64\n",
      "sex\n",
      "Male      21790\n",
      "Female    10771\n",
      "Name: count, dtype: int64\n",
      "sex\n",
      "Male      0.669205\n",
      "Female    0.330795\n",
      "Name: count, dtype: float64\n",
      "native_country\n",
      "United-States                 29170\n",
      "Mexico                          643\n",
      "?                               583\n",
      "Philippines                     198\n",
      "Germany                         137\n",
      "Canada                          121\n",
      "Puerto-Rico                     114\n",
      "El-Salvador                     106\n",
      "India                           100\n",
      "Cuba                             95\n",
      "England                          90\n",
      "Jamaica                          81\n",
      "South                            80\n",
      "China                            75\n",
      "Italy                            73\n",
      "Dominican-Republic               70\n",
      "Vietnam                          67\n",
      "Guatemala                        64\n",
      "Japan                            62\n",
      "Poland                           60\n",
      "Columbia                         59\n",
      "Taiwan                           51\n",
      "Haiti                            44\n",
      "Iran                             43\n",
      "Portugal                         37\n",
      "Nicaragua                        34\n",
      "Peru                             31\n",
      "France                           29\n",
      "Greece                           29\n",
      "Ecuador                          28\n",
      "Ireland                          24\n",
      "Hong                             20\n",
      "Cambodia                         19\n",
      "Trinadad&Tobago                  19\n",
      "Laos                             18\n",
      "Thailand                         18\n",
      "Yugoslavia                       16\n",
      "Outlying-US(Guam-USVI-etc)       14\n",
      "Honduras                         13\n",
      "Hungary                          13\n",
      "Scotland                         12\n",
      "Holand-Netherlands                1\n",
      "Name: count, dtype: int64\n",
      "native_country\n",
      "United-States                 0.895857\n",
      "Mexico                        0.019748\n",
      "?                             0.017905\n",
      "Philippines                   0.006081\n",
      "Germany                       0.004207\n",
      "Canada                        0.003716\n",
      "Puerto-Rico                   0.003501\n",
      "El-Salvador                   0.003255\n",
      "India                         0.003071\n",
      "Cuba                          0.002918\n",
      "England                       0.002764\n",
      "Jamaica                       0.002488\n",
      "South                         0.002457\n",
      "China                         0.002303\n",
      "Italy                         0.002242\n",
      "Dominican-Republic            0.002150\n",
      "Vietnam                       0.002058\n",
      "Guatemala                     0.001966\n",
      "Japan                         0.001904\n",
      "Poland                        0.001843\n",
      "Columbia                      0.001812\n",
      "Taiwan                        0.001566\n",
      "Haiti                         0.001351\n",
      "Iran                          0.001321\n",
      "Portugal                      0.001136\n",
      "Nicaragua                     0.001044\n",
      "Peru                          0.000952\n",
      "France                        0.000891\n",
      "Greece                        0.000891\n",
      "Ecuador                       0.000860\n",
      "Ireland                       0.000737\n",
      "Hong                          0.000614\n",
      "Cambodia                      0.000584\n",
      "Trinadad&Tobago               0.000584\n",
      "Laos                          0.000553\n",
      "Thailand                      0.000553\n",
      "Yugoslavia                    0.000491\n",
      "Outlying-US(Guam-USVI-etc)    0.000430\n",
      "Honduras                      0.000399\n",
      "Hungary                       0.000399\n",
      "Scotland                      0.000369\n",
      "Holand-Netherlands            0.000031\n",
      "Name: count, dtype: float64\n",
      "income\n",
      "<=50K    24720\n",
      ">50K      7841\n",
      "Name: count, dtype: int64\n",
      "income\n",
      "<=50K    0.75919\n",
      ">50K     0.24081\n",
      "Name: count, dtype: float64\n"
     ]
    }
   ],
   "source": [
    "# view frequency distribution of categorical variables\n",
    "for var in categorical: \n",
    "    print(df[var].value_counts())\n",
    "    print(df[var].value_counts()/np.float32(len(df)))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we can see that there are several variables like `workclass`, `occupation` and `native_country` which contain missing values. Generally, the missing values are coded as `NaN` and python will detect them with the usual command of `df.isnull().sum()`.\n",
    "\n",
    "But, in this case the missing values are coded as `?`. Python fail to detect these as missing values because it do not consider `?` as missing values. So, I have to replace `?` with `NaN` so that Python can detect these missing values.\n",
    "\n",
    "I will explore these variables and replace `?` with `NaN`."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Explore workclass variable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['State-gov', 'Self-emp-not-inc', 'Private', 'Federal-gov',\n",
       "       'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked'],\n",
       "      dtype=object)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check labels in workclass variable\n",
    "# 属性列workclass中的具体的类别数有哪些\n",
    "df.workclass.unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "workclass\n",
       "Private             22696\n",
       "Self-emp-not-inc     2541\n",
       "Local-gov            2093\n",
       "?                    1836\n",
       "State-gov            1298\n",
       "Self-emp-inc         1116\n",
       "Federal-gov           960\n",
       "Without-pay            14\n",
       "Never-worked            7\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check frequency distribution of values in workclass variable\n",
    "# 计数，返回类似字典数据类型的统计结果，每一个具体类别的数目有多少\n",
    "df.workclass.value_counts()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that there are 1836 values encoded as `?` in workclass variable. I will replace these `?` with `NaN`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "trusted": true
   },
   "outputs": [],
   "source": [
    "# replace '?' values in workclass variable with `NaN`\n",
    "# 处理特殊值\n",
    "df['workclass'].replace('?', np.NaN, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "scrolled": true,
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "workclass\n",
       "Private             22696\n",
       "Self-emp-not-inc     2541\n",
       "Local-gov            2093\n",
       "State-gov            1298\n",
       "Self-emp-inc         1116\n",
       "Federal-gov           960\n",
       "Without-pay            14\n",
       "Never-worked            7\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# again check the frequency distribution of values in workclass variable\n",
    "# 相当于删除了某行的结果\n",
    "df.workclass.value_counts()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we can see that there are no values encoded as `?` in the `workclass` variable.\n",
    "\n",
    "I will adopt similar approach with `occupation` and `native_country` column."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Explore occupation variable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['Adm-clerical', 'Exec-managerial', 'Handlers-cleaners',\n",
       "       'Prof-specialty', 'Other-service', 'Sales', 'Craft-repair',\n",
       "       'Transport-moving', 'Farming-fishing', 'Machine-op-inspct',\n",
       "       'Tech-support', '?', 'Protective-serv', 'Armed-Forces',\n",
       "       'Priv-house-serv'], dtype=object)"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check labels in occupation variable\n",
    "\n",
    "df.occupation.unique()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "occupation\n",
       "Prof-specialty       4140\n",
       "Craft-repair         4099\n",
       "Exec-managerial      4066\n",
       "Adm-clerical         3770\n",
       "Sales                3650\n",
       "Other-service        3295\n",
       "Machine-op-inspct    2002\n",
       "?                    1843\n",
       "Transport-moving     1597\n",
       "Handlers-cleaners    1370\n",
       "Farming-fishing       994\n",
       "Tech-support          928\n",
       "Protective-serv       649\n",
       "Priv-house-serv       149\n",
       "Armed-Forces            9\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check frequency distribution of values in occupation variable\n",
    "\n",
    "df.occupation.value_counts()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that there are 1843 values encoded as `?` in `occupation` variable. I will replace these `?` with `NaN`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "trusted": true
   },
   "outputs": [],
   "source": [
    "# replace '?' values in occupation variable with `NaN`\n",
    "\n",
    "df['occupation'].replace('?', np.NaN, inplace=True)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "occupation\n",
       "Prof-specialty       4140\n",
       "Craft-repair         4099\n",
       "Exec-managerial      4066\n",
       "Adm-clerical         3770\n",
       "Sales                3650\n",
       "Other-service        3295\n",
       "Machine-op-inspct    2002\n",
       "Transport-moving     1597\n",
       "Handlers-cleaners    1370\n",
       "Farming-fishing       994\n",
       "Tech-support          928\n",
       "Protective-serv       649\n",
       "Priv-house-serv       149\n",
       "Armed-Forces            9\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# again check the frequency distribution of values in occupation variable\n",
    "df.occupation.value_counts()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Explore native_country variable\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['United-States', 'Cuba', 'Jamaica', 'India', '?', 'Mexico',\n",
       "       'South', 'Puerto-Rico', 'Honduras', 'England', 'Canada', 'Germany',\n",
       "       'Iran', 'Philippines', 'Italy', 'Poland', 'Columbia', 'Cambodia',\n",
       "       'Thailand', 'Ecuador', 'Laos', 'Taiwan', 'Haiti', 'Portugal',\n",
       "       'Dominican-Republic', 'El-Salvador', 'France', 'Guatemala',\n",
       "       'China', 'Japan', 'Yugoslavia', 'Peru',\n",
       "       'Outlying-US(Guam-USVI-etc)', 'Scotland', 'Trinadad&Tobago',\n",
       "       'Greece', 'Nicaragua', 'Vietnam', 'Hong', 'Ireland', 'Hungary',\n",
       "       'Holand-Netherlands'], dtype=object)"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check labels in native_country variable\n",
    "df.native_country.unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "native_country\n",
       "United-States                 29170\n",
       "Mexico                          643\n",
       "?                               583\n",
       "Philippines                     198\n",
       "Germany                         137\n",
       "Canada                          121\n",
       "Puerto-Rico                     114\n",
       "El-Salvador                     106\n",
       "India                           100\n",
       "Cuba                             95\n",
       "England                          90\n",
       "Jamaica                          81\n",
       "South                            80\n",
       "China                            75\n",
       "Italy                            73\n",
       "Dominican-Republic               70\n",
       "Vietnam                          67\n",
       "Guatemala                        64\n",
       "Japan                            62\n",
       "Poland                           60\n",
       "Columbia                         59\n",
       "Taiwan                           51\n",
       "Haiti                            44\n",
       "Iran                             43\n",
       "Portugal                         37\n",
       "Nicaragua                        34\n",
       "Peru                             31\n",
       "France                           29\n",
       "Greece                           29\n",
       "Ecuador                          28\n",
       "Ireland                          24\n",
       "Hong                             20\n",
       "Cambodia                         19\n",
       "Trinadad&Tobago                  19\n",
       "Laos                             18\n",
       "Thailand                         18\n",
       "Yugoslavia                       16\n",
       "Outlying-US(Guam-USVI-etc)       14\n",
       "Honduras                         13\n",
       "Hungary                          13\n",
       "Scotland                         12\n",
       "Holand-Netherlands                1\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check frequency distribution of values in native_country variable\n",
    "df.native_country.value_counts()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that there are 583 values encoded as `?` in `native_country` variable. I will replace these `?` with `NaN`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "trusted": true
   },
   "outputs": [],
   "source": [
    "# replace '?' values in native_country variable with `NaN`\n",
    "\n",
    "df['native_country'].replace('?', np.NaN, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "native_country\n",
       "United-States                 29170\n",
       "Mexico                          643\n",
       "Philippines                     198\n",
       "Germany                         137\n",
       "Canada                          121\n",
       "Puerto-Rico                     114\n",
       "El-Salvador                     106\n",
       "India                           100\n",
       "Cuba                             95\n",
       "England                          90\n",
       "Jamaica                          81\n",
       "South                            80\n",
       "China                            75\n",
       "Italy                            73\n",
       "Dominican-Republic               70\n",
       "Vietnam                          67\n",
       "Guatemala                        64\n",
       "Japan                            62\n",
       "Poland                           60\n",
       "Columbia                         59\n",
       "Taiwan                           51\n",
       "Haiti                            44\n",
       "Iran                             43\n",
       "Portugal                         37\n",
       "Nicaragua                        34\n",
       "Peru                             31\n",
       "France                           29\n",
       "Greece                           29\n",
       "Ecuador                          28\n",
       "Ireland                          24\n",
       "Hong                             20\n",
       "Cambodia                         19\n",
       "Trinadad&Tobago                  19\n",
       "Laos                             18\n",
       "Thailand                         18\n",
       "Yugoslavia                       16\n",
       "Outlying-US(Guam-USVI-etc)       14\n",
       "Honduras                         13\n",
       "Hungary                          13\n",
       "Scotland                         12\n",
       "Holand-Netherlands                1\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# again check the frequency distribution of values in native_country variable\n",
    "\n",
    "df.native_country.value_counts()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Check missing values in categorical variables again"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "workclass         1836\n",
       "education            0\n",
       "marital_status       0\n",
       "occupation        1843\n",
       "relationship         0\n",
       "race                 0\n",
       "sex                  0\n",
       "native_country     583\n",
       "income               0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[categorical].isnull().sum()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we can see that `workclass`, `occupation` and `native_country` variable contains missing values."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Number of labels: cardinality\n",
    "\n",
    "\n",
    "The number of labels within a categorical variable is known as **cardinality**. A high number of labels within a variable is known as **high cardinality**. High cardinality may pose some serious problems in the machine learning model. So, I will check for high cardinality."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "workclass  contains  9  labels\n",
      "education  contains  16  labels\n",
      "marital_status  contains  7  labels\n",
      "occupation  contains  15  labels\n",
      "relationship  contains  6  labels\n",
      "race  contains  5  labels\n",
      "sex  contains  2  labels\n",
      "native_country  contains  42  labels\n",
      "income  contains  2  labels\n"
     ]
    }
   ],
   "source": [
    "# check for cardinality in categorical variables\n",
    "\n",
    "for var in categorical:\n",
    "    print(var, ' contains ', len(df[var].unique()), ' labels')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that `native_country` column contains relatively large number of labels as compared to other columns. I will check for cardinality after train-test split."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Explore Numerical Variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are 6 numerical variables\n",
      "\n",
      "The numerical variables are : ['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']\n"
     ]
    }
   ],
   "source": [
    "# find numerical variables\n",
    "\n",
    "numerical = [var for var in df.columns if df[var].dtype!='O']\n",
    "\n",
    "print('There are {} numerical variables\\n'.format(len(numerical)))\n",
    "\n",
    "print('The numerical variables are :', numerical)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>fnlwgt</th>\n",
       "      <th>education_num</th>\n",
       "      <th>capital_gain</th>\n",
       "      <th>capital_loss</th>\n",
       "      <th>hours_per_week</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>39</td>\n",
       "      <td>77516</td>\n",
       "      <td>13</td>\n",
       "      <td>2174</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>50</td>\n",
       "      <td>83311</td>\n",
       "      <td>13</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>38</td>\n",
       "      <td>215646</td>\n",
       "      <td>9</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>53</td>\n",
       "      <td>234721</td>\n",
       "      <td>7</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>28</td>\n",
       "      <td>338409</td>\n",
       "      <td>13</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   age  fnlwgt  education_num  capital_gain  capital_loss  hours_per_week\n",
       "0   39   77516             13          2174             0              40\n",
       "1   50   83311             13             0             0              13\n",
       "2   38  215646              9             0             0              40\n",
       "3   53  234721              7             0             0              40\n",
       "4   28  338409             13             0             0              40"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# view the numerical variables\n",
    "\n",
    "df[numerical].head()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Summary of numerical variables\n",
    "\n",
    "\n",
    "- There are 6 numerical variables. \n",
    "\n",
    "\n",
    "- These are given by `age`, `fnlwgt`, `education_num`, `capital_gain`, `capital_loss` and `hours_per_week`.\n",
    "\n",
    "\n",
    "- All of the numerical variables are of discrete data type."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Explore problems within numerical variables\n",
    "\n",
    "\n",
    "Now, I will explore the numerical variables.\n",
    "\n",
    "\n",
    "### Missing values in numerical variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "age               0\n",
       "fnlwgt            0\n",
       "education_num     0\n",
       "capital_gain      0\n",
       "capital_loss      0\n",
       "hours_per_week    0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check missing values in numerical variables\n",
    "\n",
    "df[numerical].isnull().sum()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that all the 6 numerical variables do not contain missing values. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **8. Declare feature vector and target variable** <a class=\"anchor\" id=\"8\"></a>\n",
    "\n",
    "[Table of Contents](#0.1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "trusted": true
   },
   "outputs": [],
   "source": [
    "X = df.drop(['income'], axis=1)\n",
    "\n",
    "y = df['income']"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **9. Split data into separate training and test set** <a class=\"anchor\" id=\"9\"></a>\n",
    "\n",
    "[Table of Contents](#0.1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "trusted": true
   },
   "outputs": [],
   "source": [
    "# split X and y into training and testing sets\n",
    "\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((22792, 14), (9769, 14))"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check the shape of X_train and X_test\n",
    "\n",
    "X_train.shape, X_test.shape"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **10. Feature Engineering** <a class=\"anchor\" id=\"10\"></a>\n",
    "\n",
    "[Table of Contents](#0.1)\n",
    "\n",
    "\n",
    "**Feature Engineering** is the process of transforming raw data into useful features that help us to understand our model better and increase its predictive power. I will carry out feature engineering on different types of variables.\n",
    "\n",
    "\n",
    "First, I will display the categorical and numerical variables again separately."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "age                int64\n",
       "workclass         object\n",
       "fnlwgt             int64\n",
       "education         object\n",
       "education_num      int64\n",
       "marital_status    object\n",
       "occupation        object\n",
       "relationship      object\n",
       "race              object\n",
       "sex               object\n",
       "capital_gain       int64\n",
       "capital_loss       int64\n",
       "hours_per_week     int64\n",
       "native_country    object\n",
       "dtype: object"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check data types in X_train\n",
    "\n",
    "X_train.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['workclass',\n",
       " 'education',\n",
       " 'marital_status',\n",
       " 'occupation',\n",
       " 'relationship',\n",
       " 'race',\n",
       " 'sex',\n",
       " 'native_country']"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# display categorical variables\n",
    "\n",
    "categorical = [col for col in X_train.columns if X_train[col].dtypes == 'O']\n",
    "\n",
    "categorical"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['age',\n",
       " 'fnlwgt',\n",
       " 'education_num',\n",
       " 'capital_gain',\n",
       " 'capital_loss',\n",
       " 'hours_per_week']"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# display numerical variables\n",
    "\n",
    "numerical = [col for col in X_train.columns if X_train[col].dtypes != 'O']\n",
    "\n",
    "numerical"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Engineering missing values in categorical variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# print percentage of missing values in the categorical variables in training set\n",
    "0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "workclass 0.055984555984555984\n",
      "occupation 0.05607230607230607\n",
      "native_country 0.018164268164268166\n"
     ]
    }
   ],
   "source": [
    "# print categorical variables with missing data\n",
    "for col in categorical:\n",
    "    if X_train[col].isnull().mean()>0:\n",
    "        print(col, (X_train[col].isnull().mean()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "trusted": true
   },
   "outputs": [],
   "source": [
    "# impute missing categorical variables with most frequent value\n",
    "for df2 in [X_train, X_test]:\n",
    "    df2['workclass'].fillna(X_train['workclass'].mode()[0], inplace=True)\n",
    "    df2['occupation'].fillna(X_train['occupation'].mode()[0], inplace=True)\n",
    "    df2['native_country'].fillna(X_train['native_country'].mode()[0], inplace=True)    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "workclass         0\n",
       "education         0\n",
       "marital_status    0\n",
       "occupation        0\n",
       "relationship      0\n",
       "race              0\n",
       "sex               0\n",
       "native_country    0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check missing values in categorical variables in X_train\n",
    "X_train[categorical].isnull().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "workclass         0\n",
       "education         0\n",
       "marital_status    0\n",
       "occupation        0\n",
       "relationship      0\n",
       "race              0\n",
       "sex               0\n",
       "native_country    0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check missing values in categorical variables in X_test\n",
    "X_test[categorical].isnull().sum()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a final check, I will check for missing values in X_train and X_test."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "age               0\n",
       "workclass         0\n",
       "fnlwgt            0\n",
       "education         0\n",
       "education_num     0\n",
       "marital_status    0\n",
       "occupation        0\n",
       "relationship      0\n",
       "race              0\n",
       "sex               0\n",
       "capital_gain      0\n",
       "capital_loss      0\n",
       "hours_per_week    0\n",
       "native_country    0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check missing values in X_train\n",
    "X_train.isnull().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "age               0\n",
       "workclass         0\n",
       "fnlwgt            0\n",
       "education         0\n",
       "education_num     0\n",
       "marital_status    0\n",
       "occupation        0\n",
       "relationship      0\n",
       "race              0\n",
       "sex               0\n",
       "capital_gain      0\n",
       "capital_loss      0\n",
       "hours_per_week    0\n",
       "native_country    0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check missing values in X_test\n",
    "X_test.isnull().sum()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that there are no missing values in X_train and X_test."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Encode categorical variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['workclass',\n",
       " 'education',\n",
       " 'marital_status',\n",
       " 'occupation',\n",
       " 'relationship',\n",
       " 'race',\n",
       " 'sex',\n",
       " 'native_country']"
      ]
     },
     "execution_count": 80,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# print categorical variables\n",
    "\n",
    "categorical"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "ename": "KeyError",
     "evalue": "\"['workclass' 'education' 'marital_status' 'occupation' 'relationship'\\n 'race' 'sex' 'native_country'] not in index\"",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mKeyError\u001b[0m                                  Traceback (most recent call last)",
      "Cell \u001b[1;32mIn[81], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m X_train[categorical]\u001b[38;5;241m.\u001b[39mhead()\n",
      "File \u001b[1;32me:\\Programs\\Anaconda\\Lib\\site-packages\\pandas\\core\\frame.py:3899\u001b[0m, in \u001b[0;36mDataFrame.__getitem__\u001b[1;34m(self, key)\u001b[0m\n\u001b[0;32m   3897\u001b[0m     \u001b[38;5;28;01mif\u001b[39;00m is_iterator(key):\n\u001b[0;32m   3898\u001b[0m         key \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mlist\u001b[39m(key)\n\u001b[1;32m-> 3899\u001b[0m     indexer \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mcolumns\u001b[38;5;241m.\u001b[39m_get_indexer_strict(key, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcolumns\u001b[39m\u001b[38;5;124m\"\u001b[39m)[\u001b[38;5;241m1\u001b[39m]\n\u001b[0;32m   3901\u001b[0m \u001b[38;5;66;03m# take() does not accept boolean indexers\u001b[39;00m\n\u001b[0;32m   3902\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mgetattr\u001b[39m(indexer, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdtype\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m) \u001b[38;5;241m==\u001b[39m \u001b[38;5;28mbool\u001b[39m:\n",
      "File \u001b[1;32me:\\Programs\\Anaconda\\Lib\\site-packages\\pandas\\core\\indexes\\multi.py:2648\u001b[0m, in \u001b[0;36mMultiIndex._get_indexer_strict\u001b[1;34m(self, key, axis_name)\u001b[0m\n\u001b[0;32m   2645\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(keyarr) \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(keyarr[\u001b[38;5;241m0\u001b[39m], \u001b[38;5;28mtuple\u001b[39m):\n\u001b[0;32m   2646\u001b[0m     indexer \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_get_indexer_level_0(keyarr)\n\u001b[1;32m-> 2648\u001b[0m     \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_raise_if_missing(key, indexer, axis_name)\n\u001b[0;32m   2649\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m[indexer], indexer\n\u001b[0;32m   2651\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28msuper\u001b[39m()\u001b[38;5;241m.\u001b[39m_get_indexer_strict(key, axis_name)\n",
      "File \u001b[1;32me:\\Programs\\Anaconda\\Lib\\site-packages\\pandas\\core\\indexes\\multi.py:2666\u001b[0m, in \u001b[0;36mMultiIndex._raise_if_missing\u001b[1;34m(self, key, indexer, axis_name)\u001b[0m\n\u001b[0;32m   2664\u001b[0m cmask \u001b[38;5;241m=\u001b[39m check \u001b[38;5;241m==\u001b[39m \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m1\u001b[39m\n\u001b[0;32m   2665\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m cmask\u001b[38;5;241m.\u001b[39many():\n\u001b[1;32m-> 2666\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mkeyarr[cmask]\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m not in index\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m   2667\u001b[0m \u001b[38;5;66;03m# We get here when levels still contain values which are not\u001b[39;00m\n\u001b[0;32m   2668\u001b[0m \u001b[38;5;66;03m# actually in Index anymore\u001b[39;00m\n\u001b[0;32m   2669\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mkeyarr\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m not in index\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
      "\u001b[1;31mKeyError\u001b[0m: \"['workclass' 'education' 'marital_status' 'occupation' 'relationship'\\n 'race' 'sex' 'native_country'] not in index\""
     ]
    }
   ],
   "source": [
    "X_train[categorical].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {
    "trusted": true
   },
   "outputs": [],
   "source": [
    "# import category encoders\n",
    "#pip install category_encoders\n",
    "import category_encoders as ce"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "X does not contain the columns listed in cols",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "Cell \u001b[1;32mIn[79], line 23\u001b[0m\n\u001b[0;32m     17\u001b[0m encoder \u001b[38;5;241m=\u001b[39m ce\u001b[38;5;241m.\u001b[39mOneHotEncoder(cols\u001b[38;5;241m=\u001b[39m[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mworkclass\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124meducation\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mmarital_status\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124moccupation\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mrelationship\u001b[39m\u001b[38;5;124m'\u001b[39m, \n\u001b[0;32m     18\u001b[0m                                  \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mrace\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124msex\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mnative_country\u001b[39m\u001b[38;5;124m'\u001b[39m])\n\u001b[0;32m     20\u001b[0m \u001b[38;5;66;03m# encoder = OneHotEncoder(categorical_features =['workclass', 'education', 'marital_status', 'occupation', 'relationship', \u001b[39;00m\n\u001b[0;32m     21\u001b[0m \u001b[38;5;66;03m#                                  'race', 'sex', 'native_country'])\u001b[39;00m\n\u001b[1;32m---> 23\u001b[0m X_train \u001b[38;5;241m=\u001b[39m encoder\u001b[38;5;241m.\u001b[39mfit_transform(X_train)\n\u001b[0;32m     25\u001b[0m X_test \u001b[38;5;241m=\u001b[39m encoder\u001b[38;5;241m.\u001b[39mtransform(X_test)\n",
      "File \u001b[1;32me:\\Programs\\Anaconda\\Lib\\site-packages\\sklearn\\utils\\_set_output.py:316\u001b[0m, in \u001b[0;36m_wrap_method_output.<locals>.wrapped\u001b[1;34m(self, X, *args, **kwargs)\u001b[0m\n\u001b[0;32m    314\u001b[0m \u001b[38;5;129m@wraps\u001b[39m(f)\n\u001b[0;32m    315\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mwrapped\u001b[39m(\u001b[38;5;28mself\u001b[39m, X, \u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs):\n\u001b[1;32m--> 316\u001b[0m     data_to_wrap \u001b[38;5;241m=\u001b[39m f(\u001b[38;5;28mself\u001b[39m, X, \u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n\u001b[0;32m    317\u001b[0m     \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(data_to_wrap, \u001b[38;5;28mtuple\u001b[39m):\n\u001b[0;32m    318\u001b[0m         \u001b[38;5;66;03m# only wrap the first output for cross decomposition\u001b[39;00m\n\u001b[0;32m    319\u001b[0m         return_tuple \u001b[38;5;241m=\u001b[39m (\n\u001b[0;32m    320\u001b[0m             _wrap_data_with_container(method, data_to_wrap[\u001b[38;5;241m0\u001b[39m], X, \u001b[38;5;28mself\u001b[39m),\n\u001b[0;32m    321\u001b[0m             \u001b[38;5;241m*\u001b[39mdata_to_wrap[\u001b[38;5;241m1\u001b[39m:],\n\u001b[0;32m    322\u001b[0m         )\n",
      "File \u001b[1;32me:\\Programs\\Anaconda\\Lib\\site-packages\\sklearn\\base.py:1098\u001b[0m, in \u001b[0;36mTransformerMixin.fit_transform\u001b[1;34m(self, X, y, **fit_params)\u001b[0m\n\u001b[0;32m   1083\u001b[0m         warnings\u001b[38;5;241m.\u001b[39mwarn(\n\u001b[0;32m   1084\u001b[0m             (\n\u001b[0;32m   1085\u001b[0m                 \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mThis object (\u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m\u001b[38;5;18m__class__\u001b[39m\u001b[38;5;241m.\u001b[39m\u001b[38;5;18m__name__\u001b[39m\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m) has a `transform`\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m   (...)\u001b[0m\n\u001b[0;32m   1093\u001b[0m             \u001b[38;5;167;01mUserWarning\u001b[39;00m,\n\u001b[0;32m   1094\u001b[0m         )\n\u001b[0;32m   1096\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m y \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[0;32m   1097\u001b[0m     \u001b[38;5;66;03m# fit method of arity 1 (unsupervised transformation)\u001b[39;00m\n\u001b[1;32m-> 1098\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mfit(X, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mfit_params)\u001b[38;5;241m.\u001b[39mtransform(X)\n\u001b[0;32m   1099\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m   1100\u001b[0m     \u001b[38;5;66;03m# fit method of arity 2 (supervised transformation)\u001b[39;00m\n\u001b[0;32m   1101\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mfit(X, y, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mfit_params)\u001b[38;5;241m.\u001b[39mtransform(X)\n",
      "File \u001b[1;32me:\\Programs\\Anaconda\\Lib\\site-packages\\category_encoders\\utils.py:306\u001b[0m, in \u001b[0;36mBaseEncoder.fit\u001b[1;34m(self, X, y, **kwargs)\u001b[0m\n\u001b[0;32m    303\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_determine_fit_columns(X)\n\u001b[0;32m    305\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28mset\u001b[39m(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mcols)\u001b[38;5;241m.\u001b[39missubset(X\u001b[38;5;241m.\u001b[39mcolumns):\n\u001b[1;32m--> 306\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mX does not contain the columns listed in cols\u001b[39m\u001b[38;5;124m'\u001b[39m)\n\u001b[0;32m    308\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mhandle_missing \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124merror\u001b[39m\u001b[38;5;124m'\u001b[39m:\n\u001b[0;32m    309\u001b[0m     \u001b[38;5;28;01mif\u001b[39;00m X[\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mcols]\u001b[38;5;241m.\u001b[39misna()\u001b[38;5;241m.\u001b[39many()\u001b[38;5;241m.\u001b[39many():\n",
      "\u001b[1;31mValueError\u001b[0m: X does not contain the columns listed in cols"
     ]
    }
   ],
   "source": [
    "# encode remaining variables with one-hot encoding\n",
    "# from sklearn.preprocessing import OneHotEncoder\n",
    "'''\n",
    "将categorical的属性按照类别数目构造成二分类的属性列\n",
    "比如workclass有8个不同的属性值\n",
    "Private             22696\n",
    "Self-emp-not-inc     2541\n",
    "Local-gov            2093\n",
    "State-gov            1298\n",
    "Self-emp-inc         1116\n",
    "Federal-gov           960\n",
    "Without-pay            14\n",
    "Never-worked            7\n",
    "则第一列表示为属性Private是否为0或为1\n",
    "第二列表示为属性Self-emp-not-inc是否为0或为1\n",
    "'''\n",
    "encoder = ce.OneHotEncoder(cols=['workclass', 'education', 'marital_status', 'occupation', 'relationship', \n",
    "                                 'race', 'sex', 'native_country'])\n",
    "\n",
    "# encoder = OneHotEncoder(categorical_features =['workclass', 'education', 'marital_status', 'occupation', 'relationship', \n",
    "#                                  'race', 'sex', 'native_country'])\n",
    "\n",
    "X_train = encoder.fit_transform(X_train)\n",
    "\n",
    "X_test = encoder.transform(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>workclass_1</th>\n",
       "      <th>workclass_2</th>\n",
       "      <th>workclass_3</th>\n",
       "      <th>workclass_4</th>\n",
       "      <th>workclass_5</th>\n",
       "      <th>workclass_6</th>\n",
       "      <th>workclass_7</th>\n",
       "      <th>workclass_8</th>\n",
       "      <th>fnlwgt</th>\n",
       "      <th>...</th>\n",
       "      <th>native_country_32</th>\n",
       "      <th>native_country_33</th>\n",
       "      <th>native_country_34</th>\n",
       "      <th>native_country_35</th>\n",
       "      <th>native_country_36</th>\n",
       "      <th>native_country_37</th>\n",
       "      <th>native_country_38</th>\n",
       "      <th>native_country_39</th>\n",
       "      <th>native_country_40</th>\n",
       "      <th>native_country_41</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>32098</th>\n",
       "      <td>45</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>170871</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25206</th>\n",
       "      <td>47</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>108890</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23491</th>\n",
       "      <td>48</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>187505</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12367</th>\n",
       "      <td>29</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>145592</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7054</th>\n",
       "      <td>23</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>203003</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 105 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       age  workclass_1  workclass_2  workclass_3  workclass_4  workclass_5  \\\n",
       "32098   45            1            0            0            0            0   \n",
       "25206   47            0            1            0            0            0   \n",
       "23491   48            1            0            0            0            0   \n",
       "12367   29            1            0            0            0            0   \n",
       "7054    23            1            0            0            0            0   \n",
       "\n",
       "       workclass_6  workclass_7  workclass_8  fnlwgt  ...  native_country_32  \\\n",
       "32098            0            0            0  170871  ...                  0   \n",
       "25206            0            0            0  108890  ...                  0   \n",
       "23491            0            0            0  187505  ...                  0   \n",
       "12367            0            0            0  145592  ...                  0   \n",
       "7054             0            0            0  203003  ...                  0   \n",
       "\n",
       "       native_country_33  native_country_34  native_country_35  \\\n",
       "32098                  0                  0                  0   \n",
       "25206                  0                  0                  0   \n",
       "23491                  0                  0                  0   \n",
       "12367                  0                  0                  0   \n",
       "7054                   0                  0                  0   \n",
       "\n",
       "       native_country_36  native_country_37  native_country_38  \\\n",
       "32098                  0                  0                  0   \n",
       "25206                  0                  0                  0   \n",
       "23491                  0                  0                  0   \n",
       "12367                  0                  0                  0   \n",
       "7054                   0                  0                  0   \n",
       "\n",
       "       native_country_39  native_country_40  native_country_41  \n",
       "32098                  0                  0                  0  \n",
       "25206                  0                  0                  0  \n",
       "23491                  0                  0                  0  \n",
       "12367                  0                  0                  0  \n",
       "7054                   0                  0                  0  \n",
       "\n",
       "[5 rows x 105 columns]"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(22792, 105)"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train.shape"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that from the initial 14 columns, we now have 113 columns."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly, I will take a look at the `X_test` set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>workclass_1</th>\n",
       "      <th>workclass_2</th>\n",
       "      <th>workclass_3</th>\n",
       "      <th>workclass_4</th>\n",
       "      <th>workclass_5</th>\n",
       "      <th>workclass_6</th>\n",
       "      <th>workclass_7</th>\n",
       "      <th>workclass_8</th>\n",
       "      <th>fnlwgt</th>\n",
       "      <th>...</th>\n",
       "      <th>native_country_32</th>\n",
       "      <th>native_country_33</th>\n",
       "      <th>native_country_34</th>\n",
       "      <th>native_country_35</th>\n",
       "      <th>native_country_36</th>\n",
       "      <th>native_country_37</th>\n",
       "      <th>native_country_38</th>\n",
       "      <th>native_country_39</th>\n",
       "      <th>native_country_40</th>\n",
       "      <th>native_country_41</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>22278</th>\n",
       "      <td>27</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>177119</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8950</th>\n",
       "      <td>27</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>216481</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7838</th>\n",
       "      <td>25</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>256263</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16505</th>\n",
       "      <td>46</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>147640</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19140</th>\n",
       "      <td>45</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>172822</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 105 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       age  workclass_1  workclass_2  workclass_3  workclass_4  workclass_5  \\\n",
       "22278   27            1            0            0            0            0   \n",
       "8950    27            1            0            0            0            0   \n",
       "7838    25            1            0            0            0            0   \n",
       "16505   46            1            0            0            0            0   \n",
       "19140   45            1            0            0            0            0   \n",
       "\n",
       "       workclass_6  workclass_7  workclass_8  fnlwgt  ...  native_country_32  \\\n",
       "22278            0            0            0  177119  ...                  0   \n",
       "8950             0            0            0  216481  ...                  0   \n",
       "7838             0            0            0  256263  ...                  0   \n",
       "16505            0            0            0  147640  ...                  0   \n",
       "19140            0            0            0  172822  ...                  0   \n",
       "\n",
       "       native_country_33  native_country_34  native_country_35  \\\n",
       "22278                  0                  0                  0   \n",
       "8950                   0                  0                  0   \n",
       "7838                   0                  0                  0   \n",
       "16505                  0                  0                  0   \n",
       "19140                  0                  0                  0   \n",
       "\n",
       "       native_country_36  native_country_37  native_country_38  \\\n",
       "22278                  0                  0                  0   \n",
       "8950                   0                  0                  0   \n",
       "7838                   0                  0                  0   \n",
       "16505                  0                  0                  0   \n",
       "19140                  0                  0                  0   \n",
       "\n",
       "       native_country_39  native_country_40  native_country_41  \n",
       "22278                  0                  0                  0  \n",
       "8950                   0                  0                  0  \n",
       "7838                   0                  0                  0  \n",
       "16505                  0                  0                  0  \n",
       "19140                  0                  0                  0  \n",
       "\n",
       "[5 rows x 105 columns]"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_test.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(9769, 105)"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_test.shape"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now have training and testing set ready for model building. Before that, we should map all the feature variables onto the same scale. It is called `feature scaling`. I will do it as follows."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **11. Feature Scaling** <a class=\"anchor\" id=\"11\"></a>\n",
    "\n",
    "[Table of Contents](#0.1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "trusted": true
   },
   "outputs": [],
   "source": [
    "cols = X_train.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "trusted": true
   },
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import RobustScaler\n",
    "\n",
    "scaler = RobustScaler()\n",
    "\n",
    "X_train = scaler.fit_transform(X_train)\n",
    "\n",
    "X_test = scaler.transform(X_test)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "trusted": true
   },
   "outputs": [],
   "source": [
    "X_train = pd.DataFrame(X_train, columns=[cols])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "trusted": true
   },
   "outputs": [],
   "source": [
    "X_test = pd.DataFrame(X_test, columns=[cols])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>workclass_1</th>\n",
       "      <th>workclass_2</th>\n",
       "      <th>workclass_3</th>\n",
       "      <th>workclass_4</th>\n",
       "      <th>workclass_5</th>\n",
       "      <th>workclass_6</th>\n",
       "      <th>workclass_7</th>\n",
       "      <th>workclass_8</th>\n",
       "      <th>fnlwgt</th>\n",
       "      <th>...</th>\n",
       "      <th>native_country_32</th>\n",
       "      <th>native_country_33</th>\n",
       "      <th>native_country_34</th>\n",
       "      <th>native_country_35</th>\n",
       "      <th>native_country_36</th>\n",
       "      <th>native_country_37</th>\n",
       "      <th>native_country_38</th>\n",
       "      <th>native_country_39</th>\n",
       "      <th>native_country_40</th>\n",
       "      <th>native_country_41</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.40</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>-0.058906</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.50</td>\n",
       "      <td>-1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>-0.578076</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.55</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.080425</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-0.40</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>-0.270650</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-0.70</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.210240</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 105 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    age workclass_1 workclass_2 workclass_3 workclass_4 workclass_5  \\\n",
       "0  0.40         0.0         0.0         0.0         0.0         0.0   \n",
       "1  0.50        -1.0         1.0         0.0         0.0         0.0   \n",
       "2  0.55         0.0         0.0         0.0         0.0         0.0   \n",
       "3 -0.40         0.0         0.0         0.0         0.0         0.0   \n",
       "4 -0.70         0.0         0.0         0.0         0.0         0.0   \n",
       "\n",
       "  workclass_6 workclass_7 workclass_8    fnlwgt  ... native_country_32  \\\n",
       "0         0.0         0.0         0.0 -0.058906  ...               0.0   \n",
       "1         0.0         0.0         0.0 -0.578076  ...               0.0   \n",
       "2         0.0         0.0         0.0  0.080425  ...               0.0   \n",
       "3         0.0         0.0         0.0 -0.270650  ...               0.0   \n",
       "4         0.0         0.0         0.0  0.210240  ...               0.0   \n",
       "\n",
       "  native_country_33 native_country_34 native_country_35 native_country_36  \\\n",
       "0               0.0               0.0               0.0               0.0   \n",
       "1               0.0               0.0               0.0               0.0   \n",
       "2               0.0               0.0               0.0               0.0   \n",
       "3               0.0               0.0               0.0               0.0   \n",
       "4               0.0               0.0               0.0               0.0   \n",
       "\n",
       "  native_country_37 native_country_38 native_country_39 native_country_40  \\\n",
       "0               0.0               0.0               0.0               0.0   \n",
       "1               0.0               0.0               0.0               0.0   \n",
       "2               0.0               0.0               0.0               0.0   \n",
       "3               0.0               0.0               0.0               0.0   \n",
       "4               0.0               0.0               0.0               0.0   \n",
       "\n",
       "  native_country_41  \n",
       "0               0.0  \n",
       "1               0.0  \n",
       "2               0.0  \n",
       "3               0.0  \n",
       "4               0.0  \n",
       "\n",
       "[5 rows x 105 columns]"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train.head()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now have `X_train` dataset ready to be fed into the Gaussian Naive Bayes classifier. I will do it as follows."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **12. Model training** <a class=\"anchor\" id=\"12\"></a>\n",
    "\n",
    "[Table of Contents](#0.1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-1 {\n",
       "  /* Definition of color scheme common for light and dark mode */\n",
       "  --sklearn-color-text: black;\n",
       "  --sklearn-color-line: gray;\n",
       "  /* Definition of color scheme for unfitted estimators */\n",
       "  --sklearn-color-unfitted-level-0: #fff5e6;\n",
       "  --sklearn-color-unfitted-level-1: #f6e4d2;\n",
       "  --sklearn-color-unfitted-level-2: #ffe0b3;\n",
       "  --sklearn-color-unfitted-level-3: chocolate;\n",
       "  /* Definition of color scheme for fitted estimators */\n",
       "  --sklearn-color-fitted-level-0: #f0f8ff;\n",
       "  --sklearn-color-fitted-level-1: #d4ebff;\n",
       "  --sklearn-color-fitted-level-2: #b3dbfd;\n",
       "  --sklearn-color-fitted-level-3: cornflowerblue;\n",
       "\n",
       "  /* Specific color for light theme */\n",
       "  --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));\n",
       "  --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, white)));\n",
       "  --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));\n",
       "  --sklearn-color-icon: #696969;\n",
       "\n",
       "  @media (prefers-color-scheme: dark) {\n",
       "    /* Redefinition of color scheme for dark theme */\n",
       "    --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));\n",
       "    --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, #111)));\n",
       "    --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));\n",
       "    --sklearn-color-icon: #878787;\n",
       "  }\n",
       "}\n",
       "\n",
       "#sk-container-id-1 {\n",
       "  color: var(--sklearn-color-text);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 pre {\n",
       "  padding: 0;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 input.sk-hidden--visually {\n",
       "  border: 0;\n",
       "  clip: rect(1px 1px 1px 1px);\n",
       "  clip: rect(1px, 1px, 1px, 1px);\n",
       "  height: 1px;\n",
       "  margin: -1px;\n",
       "  overflow: hidden;\n",
       "  padding: 0;\n",
       "  position: absolute;\n",
       "  width: 1px;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-dashed-wrapped {\n",
       "  border: 1px dashed var(--sklearn-color-line);\n",
       "  margin: 0 0.4em 0.5em 0.4em;\n",
       "  box-sizing: border-box;\n",
       "  padding-bottom: 0.4em;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-container {\n",
       "  /* jupyter's `normalize.less` sets `[hidden] { display: none; }`\n",
       "     but bootstrap.min.css set `[hidden] { display: none !important; }`\n",
       "     so we also need the `!important` here to be able to override the\n",
       "     default hidden behavior on the sphinx rendered scikit-learn.org.\n",
       "     See: https://github.com/scikit-learn/scikit-learn/issues/21755 */\n",
       "  display: inline-block !important;\n",
       "  position: relative;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-text-repr-fallback {\n",
       "  display: none;\n",
       "}\n",
       "\n",
       "div.sk-parallel-item,\n",
       "div.sk-serial,\n",
       "div.sk-item {\n",
       "  /* draw centered vertical line to link estimators */\n",
       "  background-image: linear-gradient(var(--sklearn-color-text-on-default-background), var(--sklearn-color-text-on-default-background));\n",
       "  background-size: 2px 100%;\n",
       "  background-repeat: no-repeat;\n",
       "  background-position: center center;\n",
       "}\n",
       "\n",
       "/* Parallel-specific style estimator block */\n",
       "\n",
       "#sk-container-id-1 div.sk-parallel-item::after {\n",
       "  content: \"\";\n",
       "  width: 100%;\n",
       "  border-bottom: 2px solid var(--sklearn-color-text-on-default-background);\n",
       "  flex-grow: 1;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-parallel {\n",
       "  display: flex;\n",
       "  align-items: stretch;\n",
       "  justify-content: center;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  position: relative;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-parallel-item {\n",
       "  display: flex;\n",
       "  flex-direction: column;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-parallel-item:first-child::after {\n",
       "  align-self: flex-end;\n",
       "  width: 50%;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-parallel-item:last-child::after {\n",
       "  align-self: flex-start;\n",
       "  width: 50%;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-parallel-item:only-child::after {\n",
       "  width: 0;\n",
       "}\n",
       "\n",
       "/* Serial-specific style estimator block */\n",
       "\n",
       "#sk-container-id-1 div.sk-serial {\n",
       "  display: flex;\n",
       "  flex-direction: column;\n",
       "  align-items: center;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  padding-right: 1em;\n",
       "  padding-left: 1em;\n",
       "}\n",
       "\n",
       "\n",
       "/* Toggleable style: style used for estimator/Pipeline/ColumnTransformer box that is\n",
       "clickable and can be expanded/collapsed.\n",
       "- Pipeline and ColumnTransformer use this feature and define the default style\n",
       "- Estimators will overwrite some part of the style using the `sk-estimator` class\n",
       "*/\n",
       "\n",
       "/* Pipeline and ColumnTransformer style (default) */\n",
       "\n",
       "#sk-container-id-1 div.sk-toggleable {\n",
       "  /* Default theme specific background. It is overwritten whether we have a\n",
       "  specific estimator or a Pipeline/ColumnTransformer */\n",
       "  background-color: var(--sklearn-color-background);\n",
       "}\n",
       "\n",
       "/* Toggleable label */\n",
       "#sk-container-id-1 label.sk-toggleable__label {\n",
       "  cursor: pointer;\n",
       "  display: block;\n",
       "  width: 100%;\n",
       "  margin-bottom: 0;\n",
       "  padding: 0.5em;\n",
       "  box-sizing: border-box;\n",
       "  text-align: center;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 label.sk-toggleable__label-arrow:before {\n",
       "  /* Arrow on the left of the label */\n",
       "  content: \"▸\";\n",
       "  float: left;\n",
       "  margin-right: 0.25em;\n",
       "  color: var(--sklearn-color-icon);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {\n",
       "  color: var(--sklearn-color-text);\n",
       "}\n",
       "\n",
       "/* Toggleable content - dropdown */\n",
       "\n",
       "#sk-container-id-1 div.sk-toggleable__content {\n",
       "  max-height: 0;\n",
       "  max-width: 0;\n",
       "  overflow: hidden;\n",
       "  text-align: left;\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-toggleable__content.fitted {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-toggleable__content pre {\n",
       "  margin: 0.2em;\n",
       "  border-radius: 0.25em;\n",
       "  color: var(--sklearn-color-text);\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-toggleable__content.fitted pre {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {\n",
       "  /* Expand drop-down */\n",
       "  max-height: 200px;\n",
       "  max-width: 100%;\n",
       "  overflow: auto;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {\n",
       "  content: \"▾\";\n",
       "}\n",
       "\n",
       "/* Pipeline/ColumnTransformer-specific style */\n",
       "\n",
       "#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  color: var(--sklearn-color-text);\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-label.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "/* Estimator-specific style */\n",
       "\n",
       "/* Colorize estimator box */\n",
       "#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-estimator.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-label label.sk-toggleable__label,\n",
       "#sk-container-id-1 div.sk-label label {\n",
       "  /* The background is the default theme color */\n",
       "  color: var(--sklearn-color-text-on-default-background);\n",
       "}\n",
       "\n",
       "/* On hover, darken the color of the background */\n",
       "#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {\n",
       "  color: var(--sklearn-color-text);\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "/* Label box, darken color on hover, fitted */\n",
       "#sk-container-id-1 div.sk-label.fitted:hover label.sk-toggleable__label.fitted {\n",
       "  color: var(--sklearn-color-text);\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "/* Estimator label */\n",
       "\n",
       "#sk-container-id-1 div.sk-label label {\n",
       "  font-family: monospace;\n",
       "  font-weight: bold;\n",
       "  display: inline-block;\n",
       "  line-height: 1.2em;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-label-container {\n",
       "  text-align: center;\n",
       "}\n",
       "\n",
       "/* Estimator-specific */\n",
       "#sk-container-id-1 div.sk-estimator {\n",
       "  font-family: monospace;\n",
       "  border: 1px dotted var(--sklearn-color-border-box);\n",
       "  border-radius: 0.25em;\n",
       "  box-sizing: border-box;\n",
       "  margin-bottom: 0.5em;\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-estimator.fitted {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-0);\n",
       "}\n",
       "\n",
       "/* on hover */\n",
       "#sk-container-id-1 div.sk-estimator:hover {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-estimator.fitted:hover {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "/* Specification for estimator info (e.g. \"i\" and \"?\") */\n",
       "\n",
       "/* Common style for \"i\" and \"?\" */\n",
       "\n",
       ".sk-estimator-doc-link,\n",
       "a:link.sk-estimator-doc-link,\n",
       "a:visited.sk-estimator-doc-link {\n",
       "  float: right;\n",
       "  font-size: smaller;\n",
       "  line-height: 1em;\n",
       "  font-family: monospace;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  border-radius: 1em;\n",
       "  height: 1em;\n",
       "  width: 1em;\n",
       "  text-decoration: none !important;\n",
       "  margin-left: 1ex;\n",
       "  /* unfitted */\n",
       "  border: var(--sklearn-color-unfitted-level-1) 1pt solid;\n",
       "  color: var(--sklearn-color-unfitted-level-1);\n",
       "}\n",
       "\n",
       ".sk-estimator-doc-link.fitted,\n",
       "a:link.sk-estimator-doc-link.fitted,\n",
       "a:visited.sk-estimator-doc-link.fitted {\n",
       "  /* fitted */\n",
       "  border: var(--sklearn-color-fitted-level-1) 1pt solid;\n",
       "  color: var(--sklearn-color-fitted-level-1);\n",
       "}\n",
       "\n",
       "/* On hover */\n",
       "div.sk-estimator:hover .sk-estimator-doc-link:hover,\n",
       ".sk-estimator-doc-link:hover,\n",
       "div.sk-label-container:hover .sk-estimator-doc-link:hover,\n",
       ".sk-estimator-doc-link:hover {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-3);\n",
       "  color: var(--sklearn-color-background);\n",
       "  text-decoration: none;\n",
       "}\n",
       "\n",
       "div.sk-estimator.fitted:hover .sk-estimator-doc-link.fitted:hover,\n",
       ".sk-estimator-doc-link.fitted:hover,\n",
       "div.sk-label-container:hover .sk-estimator-doc-link.fitted:hover,\n",
       ".sk-estimator-doc-link.fitted:hover {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-3);\n",
       "  color: var(--sklearn-color-background);\n",
       "  text-decoration: none;\n",
       "}\n",
       "\n",
       "/* Span, style for the box shown on hovering the info icon */\n",
       ".sk-estimator-doc-link span {\n",
       "  display: none;\n",
       "  z-index: 9999;\n",
       "  position: relative;\n",
       "  font-weight: normal;\n",
       "  right: .2ex;\n",
       "  padding: .5ex;\n",
       "  margin: .5ex;\n",
       "  width: min-content;\n",
       "  min-width: 20ex;\n",
       "  max-width: 50ex;\n",
       "  color: var(--sklearn-color-text);\n",
       "  box-shadow: 2pt 2pt 4pt #999;\n",
       "  /* unfitted */\n",
       "  background: var(--sklearn-color-unfitted-level-0);\n",
       "  border: .5pt solid var(--sklearn-color-unfitted-level-3);\n",
       "}\n",
       "\n",
       ".sk-estimator-doc-link.fitted span {\n",
       "  /* fitted */\n",
       "  background: var(--sklearn-color-fitted-level-0);\n",
       "  border: var(--sklearn-color-fitted-level-3);\n",
       "}\n",
       "\n",
       ".sk-estimator-doc-link:hover span {\n",
       "  display: block;\n",
       "}\n",
       "\n",
       "/* \"?\"-specific style due to the `<a>` HTML tag */\n",
       "\n",
       "#sk-container-id-1 a.estimator_doc_link {\n",
       "  float: right;\n",
       "  font-size: 1rem;\n",
       "  line-height: 1em;\n",
       "  font-family: monospace;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  border-radius: 1rem;\n",
       "  height: 1rem;\n",
       "  width: 1rem;\n",
       "  text-decoration: none;\n",
       "  /* unfitted */\n",
       "  color: var(--sklearn-color-unfitted-level-1);\n",
       "  border: var(--sklearn-color-unfitted-level-1) 1pt solid;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 a.estimator_doc_link.fitted {\n",
       "  /* fitted */\n",
       "  border: var(--sklearn-color-fitted-level-1) 1pt solid;\n",
       "  color: var(--sklearn-color-fitted-level-1);\n",
       "}\n",
       "\n",
       "/* On hover */\n",
       "#sk-container-id-1 a.estimator_doc_link:hover {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-3);\n",
       "  color: var(--sklearn-color-background);\n",
       "  text-decoration: none;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 a.estimator_doc_link.fitted:hover {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-3);\n",
       "}\n",
       "</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>GaussianNB()</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" checked><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">&nbsp;&nbsp;GaussianNB<a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.5/modules/generated/sklearn.naive_bayes.GaussianNB.html\">?<span>Documentation for GaussianNB</span></a><span class=\"sk-estimator-doc-link fitted\">i<span>Fitted</span></span></label><div class=\"sk-toggleable__content fitted\"><pre>GaussianNB()</pre></div> </div></div></div></div>"
      ],
      "text/plain": [
       "GaussianNB()"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# train a Gaussian Naive Bayes classifier on the training set\n",
    "from sklearn.naive_bayes import GaussianNB\n",
    "\n",
    "# instantiate the model\n",
    "gnb = GaussianNB()\n",
    "\n",
    "# fit the model\n",
    "gnb.fit(X_train, y_train)\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **13. Predict the results** <a class=\"anchor\" id=\"13\"></a>\n",
    "\n",
    "[Table of Contents](#0.1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['<=50K', '<=50K', '>50K', ..., '>50K', '<=50K', '<=50K'],\n",
       "      dtype='<U5')"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_pred = gnb.predict(X_test)\n",
    "\n",
    "y_pred"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **14. Check accuracy score** <a class=\"anchor\" id=\"14\"></a>\n",
    "\n",
    "[Table of Contents](#0.1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model accuracy score: 0.8083\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, **y_test** are the true class labels and **y_pred** are the predicted class labels in the test-set."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compare the train-set and test-set accuracy\n",
    "\n",
    "\n",
    "Now, I will compare the train-set and test-set accuracy to check for overfitting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['>50K', '<=50K', '>50K', ..., '<=50K', '>50K', '<=50K'],\n",
       "      dtype='<U5')"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_pred_train = gnb.predict(X_train)\n",
    "\n",
    "y_pred_train"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training-set accuracy score: 0.8067\n"
     ]
    }
   ],
   "source": [
    "print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train)))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Check for overfitting and underfitting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training set score: 0.8067\n",
      "Test set score: 0.8083\n"
     ]
    }
   ],
   "source": [
    "# print the scores on training and test set\n",
    "\n",
    "print('Training set score: {:.4f}'.format(gnb.score(X_train, y_train)))\n",
    "\n",
    "print('Test set score: {:.4f}'.format(gnb.score(X_test, y_test)))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The training-set accuracy score is 0.8067 while the test-set accuracy to be 0.8083. These two values are quite comparable. So, there is no sign of overfitting. \n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compare model accuracy with null accuracy\n",
    "\n",
    "\n",
    "So, the model accuracy is 0.8083. But, we cannot say that our model is very good based on the above accuracy. We must compare it with the **null accuracy**. Null accuracy is the accuracy that could be achieved by always predicting the most frequent class.\n",
    "\n",
    "So, we should first check the class distribution in the test set. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "income\n",
       "<=50K    7407\n",
       ">50K     2362\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check class distribution in test set\n",
    "\n",
    "y_test.value_counts()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that the occurences of most frequent class is 7407. So, we can calculate null accuracy by dividing 7407 by total number of occurences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Null accuracy score: 0.7582\n"
     ]
    }
   ],
   "source": [
    "# check null accuracy score\n",
    "\n",
    "null_accuracy = (7407/(7407+2362))\n",
    "\n",
    "print('Null accuracy score: {0:0.4f}'. format(null_accuracy))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that our model accuracy score is 0.8083 but null accuracy score is 0.7582. So, we can conclude that our Gaussian Naive Bayes Classification model is doing a very good job in predicting the class labels."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, based on the above analysis we can conclude that our classification model accuracy is very good. Our model is doing a very good job in terms of predicting the class labels.\n",
    "\n",
    "\n",
    "But, it does not give the underlying distribution of values. Also, it does not tell anything about the type of errors our classifer is making. \n",
    "\n",
    "\n",
    "We have another tool called `Confusion matrix` that comes to our rescue."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **15. Confusion matrix** <a class=\"anchor\" id=\"15\"></a>\n",
    "\n",
    "[Table of Contents](#0.1)\n",
    "\n",
    "\n",
    "A confusion matrix is a tool for summarizing the performance of a classification algorithm. A confusion matrix will give us a clear picture of classification model performance and the types of errors produced by the model. It gives us a summary of correct and incorrect predictions broken down by each category. The summary is represented in a tabular form.\n",
    "\n",
    "\n",
    "Four types of outcomes are possible while evaluating a classification model performance. These four outcomes are described below:-\n",
    "\n",
    "\n",
    "**True Positives (TP)** – True Positives occur when we predict an observation belongs to a certain class and the observation actually belongs to that class.\n",
    "\n",
    "\n",
    "**True Negatives (TN)** – True Negatives occur when we predict an observation does not belong to a certain class and the observation actually does not belong to that class.\n",
    "\n",
    "\n",
    "**False Positives (FP)** – False Positives occur when we predict an observation belongs to a    certain class but the observation actually does not belong to that class. This type of error is called **Type I error.**\n",
    "\n",
    "\n",
    "\n",
    "**False Negatives (FN)** – False Negatives occur when we predict an observation does not belong to a certain class but the observation actually belongs to that class. This is a very serious error and it is called **Type II error.**\n",
    "\n",
    "\n",
    "\n",
    "These four outcomes are summarized in a confusion matrix given below.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Confusion matrix\n",
      "\n",
      " [[5999 1408]\n",
      " [ 465 1897]]\n",
      "\n",
      "True Positives(TP) =  5999\n",
      "\n",
      "True Negatives(TN) =  1897\n",
      "\n",
      "False Positives(FP) =  1408\n",
      "\n",
      "False Negatives(FN) =  465\n"
     ]
    }
   ],
   "source": [
    "# Print the Confusion Matrix and slice it into four pieces\n",
    "\n",
    "from sklearn.metrics import confusion_matrix\n",
    "\n",
    "cm = confusion_matrix(y_test, y_pred)\n",
    "\n",
    "print('Confusion matrix\\n\\n', cm)\n",
    "\n",
    "print('\\nTrue Positives(TP) = ', cm[0,0])\n",
    "\n",
    "print('\\nTrue Negatives(TN) = ', cm[1,1])\n",
    "\n",
    "print('\\nFalse Positives(FP) = ', cm[0,1])\n",
    "\n",
    "print('\\nFalse Negatives(FN) = ', cm[1,0])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The confusion matrix shows `5999 + 1897 = 7896 correct predictions` and `1408 + 465 = 1873 incorrect predictions`.\n",
    "\n",
    "\n",
    "In this case, we have\n",
    "\n",
    "\n",
    "- `True Positives` (Actual Positive:1 and Predict Positive:1) - 5999\n",
    "\n",
    "\n",
    "- `True Negatives` (Actual Negative:0 and Predict Negative:0) - 1897\n",
    "\n",
    "\n",
    "- `False Positives` (Actual Negative:0 but Predict Positive:1) - 1408 `(Type I error)`\n",
    "\n",
    "\n",
    "- `False Negatives` (Actual Positive:1 but Predict Negative:0) - 465 `(Type II error)`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Axes: >"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAhAAAAGdCAYAAABDxkoSAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjEsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvc2/+5QAAAAlwSFlzAAAPYQAAD2EBqD+naQAAR5xJREFUeJzt3XlYVGX/P/D3sA0IAooCmogYuWD4JJKKmaaiqJialKmklJRpuOIWZWqZgpa5ZGVpiZYWWmoqlQsKaeASCqIk7pELuCKCCgzcvz/8cb7OjDZz8OCM0/v1XOe6hvvc557PnKeRD/d2VEIIASIiIiIZrEwdABERET16mEAQERGRbEwgiIiISDYmEERERCQbEwgiIiKSjQkEERERycYEgoiIiGRjAkFERESyMYEgIiIi2WxMHUAlh4aDTB0Ckdk5mj3Y1CEQmSVvp+ertX0lfyfdyv1esbbMidkkEEREROZCpWIHvSG8Q0RERCQbeyCIiIh0qPj3tUFMIIiIiHRwCMMwJhBEREQ6mEAYxjtEREREsrEHgoiISIdKpTJ1CGaPCQQREZEedtAbwjtEREREsrEHgoiISAcnURrGBIKIiEgHEwjDeIeIiIhINvZAEBER6eBOlIYxgSAiItLBIQzDeIeIiIhINvZAEBER6WAPhGFMIIiIiHQwgTCMCQQREZEOFbiVtSFMsYiIiEg29kAQERHp4BCGYUwgiIiIdDCBMIx3iIiIiGRjDwQREZEO9kAYxgSCiIhIDxMIQ3iHiIiISDb2QBAREengEIZhTCCIiIh0MIEwjHeIiIiIZGMPBBERkQ4V/742iAkEERGRDg5hGMYEgoiISIdKxYdpGcIUi4iIiGRjDwQREZEODmEYptgd0mg0yM3NVao5IiIik1HBSrHDUin2yY4cOQIfHx+lmiMiIiIzxiEMIiIiHRzCMMzoBCIgIOBfz9+6deuBgyEiIjIHTCAMMzqByM7OxsCBA+87THHhwgUcO3ZMscCIiIjIfBmdQDz55JNo27YtRo4cec/zGRkZWLp0qWKBERERmYolT35UitEJxDPPPIOcnJz7nq9ZsyY6duyoSFBEREQmxSEMg4xOIBYuXPiv5x9//HHs3LnzgQMiIiIi88dVGERERDo4idIwJhBEREQ6+CwMw6qUYvn4+KBbt25aZcHBwWjcuLEiQREREZkSd6I0rEo9EBEREahbt65W2QsvvIDLly8rEhQRERGZtyolEDNmzNAri4qKetBYiIiIzALnQBhW5TtUWlqKnJwcaDQaJeMhIiIyPZVKucNCyU4gbt68icjISNSoUQMtWrSQnsA5evRoxMXFKR4gERERmR/ZCURMTAwyMzORnJwMe3t7qTw4OBgJCQmKBkdERGQSVgoeFkr2HIgNGzYgISEB7dq101rm0qJFC5w8eVLR4IiIiEzCgocelCI7N7p06RLc3d31youLi7luloiI6D9CdgIRGBiIxMRE6efKpGHZsmUICgpSLjIiIiJT4SRKg2QPYcyePRs9e/ZEdnY2NBoNFi5ciOzsbKSmpiIlJaU6YiQiInq4LHjuglJk36IOHTogIyMDGo0G/v7+2Lp1K9zd3ZGWlobWrVtXR4xERERkZqq0kdTjjz+OpUuXKh0LERGRWRAWPPSgFNk9EMHBwYiPj0dhYWF1xENERGR6KgUPCyU7gWjRogViYmLg6emJl156CT///DPKysqqIzYiIiLTsFIpd1go2QnEwoULce7cOWzYsAGOjo4YOnQoPDw8MHz4cE6iJCIi+o+o0jxTKysrdO/eHfHx8cjPz8eXX36Jffv2oUuXLkrHR0RE9PBxGadBVZpEWSkvLw8//PADvvvuOxw6dAht2rRRKi4iIiLTsdzf+4qR3QNRWFiI5cuXo1u3bvDy8sIXX3yBPn364Pjx49izZ091xEhERERmRnYC4eHhgXfffRdPPvkk0tLSkJOTg2nTpuHxxx+vjviIiIgePhNNopwxYwZUKpXW0axZM+n87du3ERUVBTc3Nzg5OSEsLAz5+flabeTm5iI0NBQ1atSAu7s7Jk2aBI1Go1UnOTkZAQEBUKvV8PX1RXx8vOxbJHsIY+PGjejatSusrLhNFxERWSgTzl1o0aIFtm/fLv1sY/N/v6rHjx+PxMRErF27Fi4uLhg1ahT69++PP/74AwBQXl6O0NBQeHp6IjU1FRcuXMDQoUNha2uL2bNnAwBOnz6N0NBQjBgxAqtWrUJSUhJef/111KtXDyEhIUbHKTuB6Natm9xLiIiIyEg2Njbw9PTUK79+/Tq+/vprrF69Wlq0sHz5cjRv3hx79uxBu3btsHXrVmRnZ2P79u3w8PDAU089hZkzZ2LKlCmYMWMG7OzssGTJEvj4+GDevHkAgObNm2P37t2YP3++rATCqG6EgIAAXLt2DQDQqlUrBAQE3PcgIiJ65Cm4kVRJSQkKCwu1jpKSkvu+9fHjx1G/fn00btwY4eHhyM3NBQCkp6ejrKwMwcHBUt1mzZqhYcOGSEtLAwCkpaXB398fHh4eUp2QkBAUFhbiyJEjUp2726isU9mGsYzqgejbty/UarX0mo/tJiIii6bgBlCxsbF4//33tcqmT5+OGTNm6NVt27Yt4uPj0bRpU1y4cAHvv/8+nn32WRw+fBh5eXmws7ODq6ur1jUeHh7Iy8sDcGd15N3JQ+X5ynP/VqewsBC3bt2Cg4ODUZ/LqARi+vTp0ut7fWAiIiK6t5iYGERHR2uVVf5Rrqtnz57S65YtW6Jt27bw9vbGmjVrjP7F/rDIngnZuHFjXLlyRa+8oKAAjRs3ViQoIiIik1JwCEOtVsPZ2VnruF8CocvV1RVNmjTBiRMn4OnpidLSUhQUFGjVyc/Pl+ZMeHp66q3KqPzZUB1nZ2dZSYrsBOLMmTMoLy/XKy8pKcHZs2flNkdERGR2hEql2PEgioqKcPLkSdSrVw+tW7eGra0tkpKSpPM5OTnIzc1FUFAQACAoKAhZWVm4ePGiVGfbtm1wdnaGn5+fVOfuNirrVLZhLKNXYWzcuFF6vWXLFri4uEg/l5eXIykpCT4+PrLenIiIyCyZ6CFYEydOxPPPPw9vb2+cP38e06dPh7W1NQYNGgQXFxdERkYiOjoatWvXhrOzM0aPHo2goCC0a9cOANC9e3f4+flhyJAhmDt3LvLy8jB16lRERUVJvR4jRozA4sWLMXnyZAwbNgw7duzAmjVrkJiYKCtWoxOIfv36AQBUKhUiIiK0ztna2qJRo0bSkhAiIiKS7+zZsxg0aBCuXLmCunXrokOHDtizZw/q1q0LAJg/fz6srKwQFhaGkpIShISE4PPPP5eut7a2xubNmzFy5EgEBQXB0dERERER+OCDD6Q6Pj4+SExMxPjx47Fw4UI0aNAAy5Ytk7WEEwBUQggh5wIfHx/s378fderUkfVGhjg0HKRoe0SW4Gj2YFOHQGSWvJ2er9b2fZ+PV6ytE5teVawtcyJ7I6nTp09XRxxERETmg9sVGGRUArFo0SIMHz4c9vb2WLRo0b/WHTNmjCKBERERkfkyKoGYP38+wsPDYW9vj/nz59+3nkqlYgJBRESPPhNNonyUGJVA3D1swSEMIiKyeMwfDHrgR2qWl5cjIyNDelYGERERWT7ZCcS4cePw9ddfA7iTPHTs2BEBAQHw8vJCcnKy0vERERE9fCqVcoeFkp1A/Pjjj/jf//4HANi0aRPOnDmDo0ePYvz48Xj33XcVD5CIiOihYwJhkOwE4vLly9J+2r/88gteeuklNGnSBMOGDUNWVpbiARIREZH5kZ1AeHh4IDs7G+Xl5fjtt9/QrVs3AMDNmzdhbW2teIBEREQPnZWCh4WSvZHUa6+9hgEDBqBevXpQqVQIDg4GAOzduxfNmjVTPEAiIqKHzoKHHpQiO4GYMWMGnnzySfzzzz946aWXpIdzWFtb4+2331Y8QCIiooeO+YNBshMIAHjxxRf1ynQfsEVERESWq0qjMykpKXj++efh6+sLX19f9OnTB7t27VI6NiIiIpMQVirFDksluwfiu+++w2uvvYb+/ftL21b/8ccf6Nq1K+Lj4zF4MJ8eaErvjg/D1PHaPUQ5J87hqS4TAQA+3u6Ie/cVBD3dFGo7G2xLOYToafG4ePm6VP+pJxvhw5jBaN2yMcorKrDh132Y8sG3KL5ZItV57pkWmD5hAFo080LxzRKs+ul3TJ+bgPLyiofzQYlkOnTgJNauTMbxv87h6uVCTP/4VTzT+cl71l04+0ck/rQHIyb0Qf/BHaXywus38dnc9di7KxsqlQodurbEWxP7wqGGWqrzZ2oOVn65BX+fyoednQ38Axpj+Pjn4Vm/drV/RlIQ50AYJLsHYtasWZg7dy4SEhIwZswYjBkzBgkJCYiLi8PMmTOrI0aS6UjOP2jUeoR0dA17HwBQw0GNzd+9AyEEeg78EF36z4CdrTV++mYiVP//y1LPoxYSV7+Lk2fy0LHve+g7JA5+TRpg6Scjpfb9mzfEhvgp2JqSiXY9YzAkahFCg1vjw7f5SHYyX7dvlaJxk/oYNeWFf623e0cW/srKhVtdZ71zcVNX4e9T+Yj9bDhmLohE1oFTWPDhj9L5C+euYPqE5XjqaV98sXo8Zi9+A9cLivHBxBWKfx4iU5OdQJw6dQrPP6//HPY+ffrwORlmQqMpR/6l69Jx5doNAEBQYBN4N6iLNyYswZGcf3Ak5x+8Hv0FAlo2xnPPtAAA9OzaCmVl5Rg3dTmOn7qA9EOnMDrma7zQqy0ae3sAAF58PgiHj+YiduE6nPo7H7v3/oV3Y1fjzYjucHK0N9nnJvo3bZ5pjtfe6okOXfzvW+fyxev4/KMNePvDwbCx0V6Wnns6H3+m5iD6vZfQ3N8bT7byQdTkfkjemoErl+704B3/6ywqyivw6ls9UN+rDp5o3gAvDumEk8fOQ1NWXq2fjxSmUvCwULITCC8vLyQlJemVb9++HV5eXooERQ/G18cTp/Z/juzdC7B8YRS86rsBANRqWwghUFJaJtW9XVKGigqB9k83vVPHzhZlZRoIIaQ6t26XAoBWndsl/9dGZR0Hezu08vep1s9GVF0qKiow573VeGnIc2j0uKfe+exDf8OppgOa+P3fv3MBbZ6AykqFv7JyAQBPNG8AKysVtmzcj/LyChTfuIWkxHS0avMEbGy5T84jxUql3GGhZCcQEyZMwJgxYzBy5Eh8++23+PbbbzFixAiMGzcOEydOrI4YSYb9B09g+IQl6DMkDmPe+QaNvNyx/cfpcHK0x74Dx1F8swSzYgbDwd4ONRzUiHv3FdjYWMPT3RUAkJx6BB51XTD+zd6wtbWGq4sjPoy5MzTh6VELALAtJRPtWjfBgD7tYWWlQn2PWnhnbH8AQD33Wib53EQPKiF+J6ytrdFvUId7nr925QZcaztplVnbWKOmswOuXbnTy1fvMTfEfjYcyz/7FaFBb+OF597D5YvXMXXOkGqPn+hhk51AjBw5Ej/88AOysrIwbtw4jBs3DocPH0ZCQgLefPNNo9ooKSlBYWGh1iEEu/eUsDU5E+sS9+Lw0Vxs//0Q+r06By7Ojgjr3Q6Xr95A+MgF6BUcgMtHlyP/yNdwcamBA1mnUFFxp8fhr2Nn8Ub0FxjzRiiu5qzAmT+/wJnci8i7WABRcWeCZNKuLLwzaxUWzY7E9RPf4lDKJ9iyMwMAUCE4iZIePcf+OosNP+zGpPdfluYDVcXVy4WY/+FadOsdiMUrx+LjpSNhY2uNmZNXavXq0SOAz8IwSNYqDCEETpw4gSZNmiA5ORk2NlXaRgKxsbF4//33tcqsnVvA1uX+Y5NUNdcLb+LE6Qt4vNGdLtmkXVlo8ew4uNWqCU15Oa4X3sTpP7/Amdw06ZqEn1OR8HMq3Ou4oPjmbQgBjHkjFKdzL0p1Fi37BYuW/YJ6HrVwraAI3l51MfPtQTj990W9GIjM3eGDp1BwtQjhobOksoryCnw1fxPWr96Fbze/i1puNVFwtUjrunJNOW4U3kItt5oAgI1rUuHoZI83xvaW6kyZORjhvT7E0cO5aO7v/XA+ED04y/29rxijM4DTp0+jT58+yM7OBgA0aNAAP/30EwIDA2W/aUxMDKKjo7XK3Fu8LrsdMsyxhho+3h7IW6e9T0flxMpO7VvAvY4zNm9L17u2cmnn0AHP4XZJKZJ26T8s7UL+NQDAgD7t8c+5yzh4mBNp6dET3Ks1WrV5QqvsnVFLEdyrNbr3eRoA4NfSG0U3buHYX2fRpHkDAMDB/ScgKgSa+zcEAJTcLtXrwbCyutPRW9nLR2QpjE4gJk2aBI1Gg++++w729vb4+OOPMXz4cBw4cED2m6rVamkL7EoqFScYKSH23XAkbj+A3HOXUN+jFqZGv4Ty8gqs+TkVADDkpU7IOXEOl64Wom1AE3w8Yyg+XfYrjp+6ILUxIqI79qQfQ1HxbXR91h+z3w3He3Hf43rhTanO+Dd7Y2tyJiqEQN8eT2PiW33xylsL+Y8kma1bN0tw/p/L0s9556/iZM451HSuAfd6teDs6qhV38bGGrXq1IRXI3cAQEMfDwS2b4oFM9dizDthKNeU47O56/Fc96fgVtcFANCmQ3OsW70L3321FZ17tMLN4hIs/+xXeNSrBd+mjz28D0sPzoInPyrF6ARi9+7d+PHHH9Ghw50JRu3atUODBg1QXFwMR0dHA1fTw/JYvdpYuXg0ars64fLVQqTuz0Gnfu/h8tU7PQ5NHq+HD6YMRG1XJ/x99hLmfroBi5b9otVG4FOPY2r0i3CqYY+ck+cxKmYZvl+3W6tO9+eewuRR/aBW2yIr+2+89PrH2Jqc+dA+J5Fcx7L/waQ3l0g/f/nJRgBAt96BmPT+QKPaePvDcHw2Zz2mjPwSKpUKz3b1x1uT+knnW7V5Am/PGoy1K5KxZmUy1Pa28GvZCLM+fQNqe1tFPw9VMyYQBqmEkTN7rKyscOHCBXh4eEhlTk5OyMrKgo/Pgy/dc2jITYiIdB3N5s6uRPfi7aS/H5GSGr++VrG2Ti17SbG2zInRPRAqlQpFRUVwcHCQyqysrHDjxg0UFhZKZc7O+ru3ERERkWUxOoEQQqBJkyZ6Za1atZJeq1QqlJdzOSYRET3iOIRhkNEJxM6dO6szDiIiIvNhwfs3KMXoBKJTp07VGQcRERE9Qqq2ExQREZEl4xCGQUwgiIiIdMl+0MN/D28RERERycYeCCIiIl2cRGmQ7B6IYcOG4caNG3rlxcXFGDZsmCJBERERmZSVSrnDQslOIFasWIFbt27pld+6dQsrV65UJCgiIiIyb0YPYRQWFkIIASEEbty4AXt7e+lceXk5fvnlF7i7u1dLkERERA+T4BCGQUYnEK6urlCpVFCpVHo7UgJ3trp+//33FQ2OiIjIJLjEwCBZO1EKIdClSxf89NNPqF27tnTOzs4O3t7eqF+/frUESURE9FBZ8NwFpcjeifL06dNo2LAhVOzeISIi+s+S3UmzY8cO/Pjjj3rla9euxYoVKxQJioiIyKRUKuUOCyU7gYiNjUWdOnX0yt3d3TF79mxFgiIiIjIpLuM0SHYCkZubCx8fH71yb29v5ObmKhIUERERmTfZCYS7uzsOHTqkV56ZmQk3NzdFgiIiIjIplYKHhZK9lfWgQYMwZswY1KxZEx07dgQApKSkYOzYsRg4cKDiARIRET1swoKHHpQiO4GYOXMmzpw5g65du8LG5s7lFRUVGDp0KOdAEBER/UfITiDs7OyQkJCAmTNnIjMzEw4ODvD394e3t3d1xEdERPTwsQfCoCo/jbNJkyb33JGSiIjokWfByy+VYlQCER0djZkzZ8LR0RHR0dH/WveTTz5RJDAiIiIyX0YlEAcPHkRZWZn0+n64OyUREVkEPgvDIKMSiJ07d97zNRERkUXiH8QGVXkOBBERkcXiJEqDjEog+vfvb3SD69atq3IwRERE9GgwKoFwcXGRXgshsH79eri4uCAwMBAAkJ6ejoKCAlmJBhERkdliD4RBRiUQy5cvl15PmTIFAwYMwJIlS2BtbQ0AKC8vx1tvvQVnZ+fqiZKIiOghEpwDYZDseabffPMNJk6cKCUPAGBtbY3o6Gh88803igZHRERE5kl2AqHRaHD06FG98qNHj6KiokKRoIiIiEzKSsHDQslehfHaa68hMjISJ0+eRJs2bQAAe/fuRVxcHF577TXFAyQiInroOIRhkOzc6OOPP8bkyZMxb948dOzYER07dsQnn3yCSZMm4aOPPqqOGImIiP5z4uLioFKpMG7cOKns9u3biIqKgpubG5ycnBAWFob8/Hyt63JzcxEaGooaNWrA3d0dkyZNgkaj0aqTnJyMgIAAqNVq+Pr6Ij4+XnZ8shMIKysrTJ48GefOnUNBQQEKCgpw7tw5TJ48WWteBBER0SPLSqXcUQX79+/Hl19+iZYtW2qVjx8/Hps2bcLatWuRkpKC8+fPa62ALC8vR2hoKEpLS5GamooVK1YgPj4e06ZNk+qcPn0aoaGh6Ny5MzIyMjBu3Di8/vrr2LJli7xbVJUPptFosH37dnz//ffS9tXnz59HUVFRVZojIiIyLyZMIIqKihAeHo6lS5eiVq1aUvn169fx9ddf45NPPkGXLl3QunVrLF++HKmpqdizZw8AYOvWrcjOzsZ3332Hp556Cj179sTMmTPx2WefobS0FACwZMkS+Pj4YN68eWjevDlGjRqFF198EfPnz5d3i+R+sL///hv+/v7o27cvoqKicOnSJQDAnDlzMHHiRLnNERERWbSSkhIUFhZqHSUlJfetHxUVhdDQUAQHB2uVp6eno6ysTKu8WbNmaNiwIdLS0gAAaWlp8Pf3h4eHh1QnJCQEhYWFOHLkiFRHt+2QkBCpDWPJTiDGjh2LwMBAXLt2DQ4ODlL5Cy+8gKSkJLnNERERmR+VckdsbCxcXFy0jtjY2Hu+7Q8//IADBw7c83xeXh7s7Ozg6uqqVe7h4YG8vDypzt3JQ+X5ynP/VqewsBC3bt0y4ubcIXsVxq5du5Camgo7Ozut8kaNGuHcuXNymyMiIjI7QsGdKGNiYhAdHa1Vplar9er9888/GDt2LLZt2wZ7e3vF3r+6yO6BqKioQHl5uV752bNnUbNmTUWCIiIiMimVSrFDrVbD2dlZ67hXApGeno6LFy8iICAANjY2sLGxQUpKChYtWgQbGxt4eHigtLQUBQUFWtfl5+fD09MTAODp6am3KqPyZ0N1nJ2dtUYWDJGdQHTv3h0LFiyQflapVCgqKsL06dPRq1cvuc0RERERgK5duyIrKwsZGRnSERgYiPDwcOm1ra2t1nSBnJwc5ObmIigoCAAQFBSErKwsXLx4Uaqzbds2ODs7w8/PT6qjO+Vg27ZtUhvGkj2E8fHHH6NHjx7w8/PD7du3MXjwYBw/fhx16tTB999/L7c5IiIi82OCh2nVrFkTTz75pFaZo6Mj3NzcpPLIyEhER0ejdu3acHZ2xujRoxEUFIR27doBuPNHvp+fH4YMGYK5c+ciLy8PU6dORVRUlNTrMWLECCxevBiTJ0/GsGHDsGPHDqxZswaJiYmy4pWdQHh5eSEzMxMJCQnIzMxEUVERIiMjER4eLqvrg4iIyGyZ6UaU8+fPh5WVFcLCwlBSUoKQkBB8/vnn0nlra2ts3rwZI0eORFBQEBwdHREREYEPPvhAquPj44PExESMHz8eCxcuRIMGDbBs2TKEhITIikUlhBDGVi4rK0OzZs2wefNmNG/eXNYbGeLQcJCi7RFZgqPZg00dApFZ8nZ6vlrbb7goRbG2csd0UqwtcyKrB8LW1ha3b9+urliIiIjMgpUFPwRLKbJvUVRUFObMmaO3rzYREZGlUHARhsWSPQdi//79SEpKwtatW+Hv7w9HR0et8+vWrVMsOCIiIjJPshMIV1dXhIWFVUcsREREZsGSew6UIjuBWL58eXXEQUREZDZUzCAMMnoOREVFBebMmYNnnnkGTz/9NN5++21Ze2YTERE9KjgHwjCjE4hZs2bhnXfegZOTEx577DEsXLgQUVFR1RkbERERmSmjE4iVK1fi888/x5YtW7BhwwZs2rQJq1atQkVFRXXGR0RE9NCxB8IwoxOI3NxcrWddBAcHQ6VS4fz589USGBERkamorJQ7LJXRH02j0eg9XtTW1hZlZWWKB0VERETmzehVGEIIvPrqq1qPIL19+zZGjBihtRcE94EgIqJHnSUPPSjF6AQiIiJCr+yVV15RNBgiIiJzYIKHcT5yjE4guP8DERERVZK9kRQREZGl4xCGYUwgiIiIdDCBMMyCF5gQERFRdWEPBBERkQ4+C8MwJhBEREQ6LHkDKKUwgSAiItLBDgjDmGMRERGRbOyBICIi0sEeCMOYQBAREelgAmEYhzCIiIhINvZAEBER6eCzMAxjAkFERKSDQxiGcQiDiIiIZGMPBBERkQ72QBjGBIKIiEiHipMgDOIQBhEREcnGHggiIiIdHMIwjAkEERGRDiYQhjGBICIi0sEEwjDOgSAiIiLZ2ANBRESkg4swDGMCQUREpINDGIZxCIOIiIhkYw8EERGRDhX/vDaICQQREZEODmEYxhyLiIiIZGMPBBERkQ4VuyAMYgJBRESkg/mDYRzCICIiItnYA0FERKSDPRCGMYEgIiLSwQTCMLNJIG7lvm/qEIjMzq6846YOgcgseTtVb/vcytowzoEgIiIi2cymB4KIiMhcsAfCMCYQREREOqxUwtQhmD0OYRAREZFs7IEgIiLSwSEMw5hAEBER6WD3vGG8R0RERCQbeyCIiIh0cBKlYUwgiIiIdHAOhGEcwiAiIiLZ2ANBRESkg39dG8Z7REREpMNKpdwhxxdffIGWLVvC2dkZzs7OCAoKwq+//iqdv337NqKiouDm5gYnJyeEhYUhPz9fq43c3FyEhoaiRo0acHd3x6RJk6DRaLTqJCcnIyAgAGq1Gr6+voiPj5d/j2RfQUREZOFUKqHYIUeDBg0QFxeH9PR0/Pnnn+jSpQv69u2LI0eOAADGjx+PTZs2Ye3atUhJScH58+fRv39/6fry8nKEhoaitLQUqampWLFiBeLj4zFt2jSpzunTpxEaGorOnTsjIyMD48aNw+uvv44tW7bIu0dCCDOZanrM1AEQmR0+jZPo3p71DK3W9l/c8btibf3YpeMDXV+7dm189NFHePHFF1G3bl2sXr0aL774IgDg6NGjaN68OdLS0tCuXTv8+uuv6N27N86fPw8PDw8AwJIlSzBlyhRcunQJdnZ2mDJlChITE3H48GHpPQYOHIiCggL89ttvRsfFHggiIiIdSg5hlJSUoLCwUOsoKSkxGEN5eTl++OEHFBcXIygoCOnp6SgrK0NwcLBUp1mzZmjYsCHS0tIAAGlpafD395eSBwAICQlBYWGh1IuRlpam1UZlnco2jL5HsmoTERH9B1gpeMTGxsLFxUXriI2Nve97Z2VlwcnJCWq1GiNGjMD69evh5+eHvLw82NnZwdXVVau+h4cH8vLyAAB5eXlayUPl+cpz/1ansLAQt27dMvoecRUGERFRNYqJiUF0dLRWmVqtvm/9pk2bIiMjA9evX8ePP/6IiIgIpKSkVHeYsjGBICIi0qHkTpRqtfpfEwZddnZ28PX1BQC0bt0a+/fvx8KFC/Hyyy+jtLQUBQUFWr0Q+fn58PT0BAB4enpi3759Wu1VrtK4u47uyo38/Hw4OzvDwcHB6Dg5hEFERKTDVMs476WiogIlJSVo3bo1bG1tkZSUJJ3LyclBbm4ugoKCAABBQUHIysrCxYsXpTrbtm2Ds7Mz/Pz8pDp3t1FZp7INY7EHgoiIyEzExMSgZ8+eaNiwIW7cuIHVq1cjOTkZW7ZsgYuLCyIjIxEdHY3atWvD2dkZo0ePRlBQENq1awcA6N69O/z8/DBkyBDMnTsXeXl5mDp1KqKioqRekBEjRmDx4sWYPHkyhg0bhh07dmDNmjVITEyUFSsTCCIiIh2m6p6/ePEihg4digsXLsDFxQUtW7bEli1b0K1bNwDA/PnzYWVlhbCwMJSUlCAkJASff/65dL21tTU2b96MkSNHIigoCI6OjoiIiMAHH3wg1fHx8UFiYiLGjx+PhQsXokGDBli2bBlCQkJkxcp9IIjMGPeBILq36t4H4tXflZu0GN+xk2JtmRPOgSAiIiLZOIRBRESkQ8lVGJaKCQQREZEOJVZPWDomEERERDo4vm8Y7xERERHJxh4IIiIiHZwDYRgTCCIiIh2cA2GY7ARCo9HgyJEj0lO9PD094efnB1tbW8WDIyIiIvNkdAJRUVGBadOm4bPPPsP169e1zrm4uGDUqFF4//33YWXFaRVERPRoYw+EYUYnEG+//Tbi4+MRFxeHkJAQ6Vni+fn52Lp1K9577z2UlpZizpw51RYsERHRw8A/hQ0zOoFYuXIlvv32W729shs1aoThw4fD29sbQ4cOZQJBRET0H2B0AnHjxg3Ur1//vufr1auH4uJiRYIiIiIyJa7CMMzoXprnnnsOEydOxOXLl/XOXb58GVOmTMFzzz2nZGxEREQmYaVS7rBURvdALFmyBL169UK9evXg7++vNQciKysLfn5+2Lx5c7UFSkRERObD6ATCy8sLmZmZ2LJlC/bs2SMt42zTpg1mz56N7t27cwUGERFZBP42M0zWPhBWVlbo2bMnevbsWV3xEBERmZwlDz0ohTtREhER6VBxEqVBVeql8fHxQbdu3bTKgoOD0bhxY0WCIiIiIvNWpR6IiIgI1K1bV6vshRdeuOcKDSIiokcNhzAMq1ICMWPGDL2yqKioB42FiIjILHASpWFVvkelpaXIycmBRqNRMh4iIiJ6BMhOIG7evInIyEjUqFEDLVq0QG5uLgBg9OjRiIuLUzxAIiKih81KJRQ7LJXsBCImJgaZmZlITk6Gvb29VB4cHIyEhARFgyMiIjIF7kRpmOw5EBs2bEBCQgLatWsHler/7kyLFi1w8uRJRYMjIiIi8yQ7gbh06RLc3d31youLi7USCiIiokeVJfccKEX2EEZgYCASExOlnyuThmXLliEoKEi5yIiIiEzEWsHDUsnugZg9ezZ69uyJ7OxsaDQaLFy4ENnZ2UhNTUVKSkp1xEhERERmRnYPRIcOHZCRkQGNRgN/f39s3boV7u7uSEtLQ+vWrasjRiIiooeKqzAMq9JGUo8//jiWLl2qdCxERERmgXMgDJPdAxEcHIz4+HgUFhZWRzxEREQmx2WchslOIFq0aIGYmBh4enripZdews8//4yysrLqiI2IiIjMlOwEYuHChTh37hw2bNgAR0dHDB06FB4eHhg+fDgnURIRkUWwVil3WKoqPQvDysoK3bt3R3x8PPLz8/Hll19i37596NKli9LxERERPXQcwjCsSpMoK+Xl5eGHH37Ad999h0OHDqFNmzZKxUVERERmTHYCUVhYiJ9++gmrV69GcnIyGjdujPDwcCQkJODxxx+vjhiJiIgeKktefqkU2QmEh4cHatWqhZdffhmxsbEIDAysjriIiIhMxpKHHpQiO4HYuHEjunbtCiurKk2fICIiIgsgO4Ho1q1bdcRBRERkNiz5GRZKMSqBCAgIQFJSEmrVqoVWrVr961M3Dxw4oFhwREREpsAhDMOMSiD69u0LtVotveZju4mIiP7bVEIIM5lqeszUARCZnV15x00dApFZetYztFrb/+roFsXaGt4sRLG2zInsmZCNGzfGlStX9MoLCgrQuHFjRYIiIiIyJe5EaZjsSZRnzpxBeXm5XnlJSQnOnj2rSFBERESmxDkQhhmdQGzcuFF6vWXLFri4uEg/l5eXIykpCT4+PspGR0RERGbJ6ASiX79+AACVSoWIiAitc7a2tmjUqBHmzZunaHBERESmwB4Iw4xOICoqKgAAPj4+2L9/P+rUqVNtQREREZkSEwjDZM+BOH36dHXEQURERI+QKj2Ns7i4GCkpKcjNzUVpaanWuTFjxigSGBERkalY82FaBslOIA4ePIhevXrh5s2bKC4uRu3atXH58mXUqFED7u7uTCCIiOiRx6c9GSb7Ho0fPx7PP/88rl27BgcHB+zZswd///03WrdujY8//rg6YiQiIiIzIzuByMjIwIQJE2BlZQVra2uUlJTAy8sLc+fOxTvvvFMdMRIRET1UVirlDkslO4GwtbWVHuXt7u6O3NxcAICLiwv++ecfZaMjIiIyASYQhsmeA9GqVSvs378fTzzxBDp16oRp06bh8uXL+Pbbb/Hkk09WR4xERERkZmT3QMyePRv16tUDAMyaNQu1atXCyJEjcenSJXz11VeKB0hERPSwWauEYoelkt0DERgYKL12d3fHb7/9pmhAREREpmbJQw9KqdI+EERERJaMCYRhsocwWrVqhYCAAL2jdevWeOaZZxAREYGdO3dWR6xEREQWLTY2Fk8//TRq1qwJd3d39OvXDzk5OVp1bt++jaioKLi5ucHJyQlhYWHIz8/XqpObm4vQ0FBpj6ZJkyZBo9Fo1UlOTkZAQADUajV8fX0RHx8vK1bZCUSPHj1w6tQpODo6onPnzujcuTOcnJxw8uRJPP3007hw4QKCg4Px888/y22aiIjILJhqFUZKSgqioqKwZ88ebNu2DWVlZejevTuKi4ulOuPHj8emTZuwdu1apKSk4Pz58+jfv790vry8HKGhoSgtLUVqaipWrFiB+Ph4TJs2Tapz+vRphIaGonPnzsjIyMC4cePw+uuvY8uWLUbHqhJCyJrh8cYbb6Bhw4Z47733tMo//PBD/P3331i6dCmmT5+OxMRE/PnnnzJaPiYnDKL/hF15x00dApFZetYztFrb/+3sr4q11aNBzypfe+nSJbi7uyMlJQUdO3bE9evXUbduXaxevRovvvgiAODo0aNo3rw50tLS0K5dO/z666/o3bs3zp8/Dw8PDwDAkiVLMGXKFFy6dAl2dnaYMmUKEhMTcfjwYem9Bg4ciIKCAqPnNsrugVizZg0GDRqkVz5w4ECsWbMGADBo0CC9LhciIqL/opKSEhQWFmodJSUlRl17/fp1AEDt2rUBAOnp6SgrK0NwcLBUp1mzZmjYsCHS0tIAAGlpafD395eSBwAICQlBYWEhjhw5ItW5u43KOpVtGEN2AmFvb4/U1FS98tTUVNjb2wO48+jvytdERESPGiuVUOyIjY2Fi4uL1hEbG2swhoqKCowbNw7PPPOMtM9SXl4e7Ozs4OrqqlXXw8MDeXl5Up27k4fK85Xn/q1OYWEhbt26ZdQ9kr0KY/To0RgxYgTS09Px9NNPAwD279+PZcuWSVtZb9myBU899ZTcpomIiMyCkg/TiomJQXR0tFaZWq02eF1UVBQOHz6M3bt3KxiNcmQnEFOnToWPjw8WL16Mb7/9FgDQtGlTLF26FIMHDwYAjBgxAiNHjlQ2UiIiokeQWq02KmG426hRo7B582b8/vvvaNCggVTu6emJ0tJSFBQUaPVC5Ofnw9PTU6qzb98+rfYqV2ncXUd35UZ+fj6cnZ3h4OBgVIxVSrLCw8ORlpaGq1ev4urVq0hLS5OSBwBwcHDgEAYRET2yTLUKQwiBUaNGYf369dixYwd8fHy0zrdu3Rq2trZISkqSynJycpCbm4ugoCAAQFBQELKysnDx4kWpzrZt2+Ds7Aw/Pz+pzt1tVNapbMMYVdpIqqCgAD/++CNOnTqFiRMnonbt2jhw4AA8PDzw2GOPVaVJIiIis2Ftoo2koqKisHr1avz888+oWbOmNGfBxcUFDg4OcHFxQWRkJKKjo1G7dm04Oztj9OjRCAoKQrt27QAA3bt3h5+fH4YMGYK5c+ciLy8PU6dORVRUlNQTMmLECCxevBiTJ0/GsGHDsGPHDqxZswaJiYlGxyp7GeehQ4cQHBwMFxcXnDlzBjk5OWjcuDGmTp2K3NxcrFy5Uk5zd+EyTiJdXMZJdG/VvYwz5cIvirXVqV4vo+uqVPfOXJYvX45XX30VwJ2NpCZMmIDvv/8eJSUlCAkJweeffy4NTwDA33//jZEjRyI5ORmOjo6IiIhAXFwcbGz+r98gOTkZ48ePR3Z2Nho0aID33ntPeg+jYpWbQAQHByMgIABz585FzZo1kZmZicaNGyM1NRWDBw/GmTNn5DR3FyYQRLqYQBDdW3UnELvyjP9L3JDqjtVUZA9h7N+/H19++aVe+WOPPSZ1tRARET3K+CwMw2QnEGq1GoWFhXrlx44dQ926dRUJioiIyJSYQBgmexVGnz598MEHH6CsrAzAnfGa3NxcTJkyBWFhYUa1ce9duUrlhkJEREQmIjuBmDdvHoqKiuDu7o5bt26hU6dO8PX1Rc2aNTFr1iyj2rj3rlz6wyJERESmYKXgYalkT6KstHv3bhw6dAhFRUUICAjQ21P735SUlOjtA65W50KttqtKKEQWi5Moie6tuicm7ruk3CTKNnU5iVJLhw4d0KFDhypde+9duZg8EBERPSqMTiCM3d9h6NChVQ6GiIjIHHAOpWFGJxBjx4697zmVSoXi4mJoNBomEERE9Mi7z35OdBej53dcu3btnkd2djYGDBgAIQS6detWnbESERGRmajyBNEbN25g6tSpaNKkCTIyMrBlyxb89ttvSsZGRERkElyFYZjsSZRlZWX49NNPMXv2bLi5uWH58uV48cUXqyM2IiIik1CpqrRA8T/F6ARCCIGVK1di2rRp0Gg0mD17NiIjI2FtbV2d8REREZEZMjqBaNmyJU6dOoXRo0dj3LhxqFGjBoqLi/XqOTs7KxogERHRw8Y5lIYZvZGUldX/jeTc63GjQgioVCqUl5dXMRQ+jZNIFzeSIrq36t5IKvPqZsXa+l/t3oq1ZU6M7oHYuXNndcZBRERkNtgDYZjRCUSnTp2qMw4iIiJ6hFR5K2siIiJLxcd5G8YEgoiISAfzB8MseY8LIiIiqibsgSAiItLBZ2EYJrsHYtiwYbhx44ZeeXFxMYYNG6ZIUERERKakUvCwVLITiBUrVuDWrVt65bdu3TL6kd9ERET0aDN6CKOwsBBCCAghcOPGDdjb20vnysvL8csvv8Dd3b1agiQiInqYLLnnQClGJxCurq5QqVRQqVRo0qSJ3nmVSoX3339f0eCIiIhMgcs4DZO1E6UQAl26dMFPP/2E2rVrS+fs7Ozg7e2N+vXrV0uQREREZF5k70R5+vRpNGzY8J7PwyAiIrIE/A1nmOxJlDt27MCPP/6oV7527VqsWLFCkaCIiIhMSaUSih2WSnYCERsbizp16uiVu7u7Y/bs2YoERUREZEpcxmmY7AQiNzcXPj4+euXe3t7Izc1VJCgiIiIyb7ITCHd3dxw6dEivPDMzE25ubooERUREZEoqlXKHpZK9lfWgQYMwZswY1KxZEx07dgQApKSkYOzYsRg4cKDiARIRET1sfFCUYbITiJkzZ+LMmTPo2rUrbGzuXF5RUYGhQ4dyDgQREdF/hOwEws7ODgkJCZg5cyYyMzPh4OAAf39/eHt7V0d8RERED50lDz0opcpP42zSpMk9d6QkIiJ61DF/MMyoBCI6OhozZ86Eo6MjoqOj/7XuJ598okhgREREZL6MSiAOHjyIsrIy6fX9cHdKIiKyBPx1ZphRCcTOnTvv+ZqIiMgSMX8wjCtViIiISDajeiD69+9vdIPr1q2rcjBERETmgI/zNsyoBMLFxUV6LYTA+vXr4eLigsDAQABAeno6CgoKZCUaRERE5or5g2FGJRDLly+XXk+ZMgUDBgzAkiVLYG1tDQAoLy/HW2+9BWdn5+qJkoiI6CGy5KdoKkUlhJB1l+rWrYvdu3ejadOmWuU5OTlo3749rly5UsVQjlXxOiLLtSvvuKlDIDJLz3qGVmv7ebc2KtaWp0MfxdoyJ7InUWo0Ghw9elSv/OjRo6ioqFAkKCIiIlPi47wNk70T5WuvvYbIyEicPHkSbdq0AQDs3bsXcXFxeO211xQPkIiI6GHjPhCGyU4gPv74Y3h6emLevHm4cOECAKBevXqYNGkSJkyYoHiAREREZH5kz4G4W2FhIQAoNHmScyCIdHEOBNG9VfcciEu3lZsDUdeecyAkGo0G27dvx/fffy9tX33+/HkUFRUpGhwREZEpWCl4WCrZQxh///03evTogdzcXJSUlKBbt26oWbMm5syZg5KSEixZsqQ64iQiIiIzIjs5Gjt2LAIDA3Ht2jU4ODhI5S+88AKSkpIUDY6IiMgUVCrlDksluwdi165dSE1NhZ2dnVZ5o0aNcO7cOcUCIyIiMh0L/s2vENk9EBUVFSgvL9crP3v2LGrWrKlIUERERGTeZCcQ3bt3x4IFC6SfVSoVioqKMH36dPTq1UvJ2IiIiExCpeD/LFWV9oHo0aMH/Pz8cPv2bQwePBjHjx9HnTp18P3331dHjERERA+VSmXJ6yeUITuB8PLyQmZmJhISEpCZmYmioiJERkYiPDxca1IlERHRo8tyew6UIiuBKCsrQ7NmzbB582aEh4cjPDy8uuIiIiIiMyarj8bW1ha3b9+urliIiIjMgqnmQPz+++94/vnnUb9+fahUKmzYsEHrvBAC06ZNQ7169eDg4IDg4GAcP669Y+3Vq1cRHh4OZ2dnuLq6IjIyUm+jx0OHDuHZZ5+Fvb09vLy8MHfuXNn3SPYgT1RUFObMmQONRiP7zYiIiB4NpnkeZ3FxMf73v//hs88+u+f5uXPnYtGiRViyZAn27t0LR0dHhISEaP1xHx4ejiNHjmDbtm3YvHkzfv/9dwwfPlw6X1hYiO7du8Pb2xvp6en46KOPMGPGDHz11VeyYpX9LIzKDaOcnJzg7+8PR0dHrfPr1q2TFcD/4bMwiHTxWRhE91bdz8K4XrpFsbZc7EKqdJ1KpcL69evRr18/AHd6H+rXr48JEyZg4sSJAIDr16/Dw8MD8fHxGDhwIP766y/4+flh//79CAwMBAD89ttv6NWrF86ePYv69evjiy++wLvvvou8vDxpT6e3334bGzZswNGjR42OT3YPhKurK8LCwhASEoL69evDxcVF6yAiInrUqVRWih1KOX36NPLy8hAcHCyVubi4oG3btkhLSwMApKWlwdXVVUoeACA4OBhWVlbYu3evVKdjx45aG0KGhIQgJycH165dMzoe2aswli9fLvcSIiKiR4xyqzBKSkpQUlKiVaZWq6FWq2W1k5eXBwDw8PDQKvfw8JDO5eXlwd3dXeu8jY0NateurVXHx8dHr43Kc7Vq1TIqHqNTo4qKCsyZMwfPPPMMnn76abz99tu4deuWsZcTERH9J8XGxur11sfGxpo6rAdmdAIxa9YsvPPOO3BycsJjjz2GhQsXIioqqjpjIyIiMgklV2HExMTg+vXrWkdMTIzsmDw9PQEA+fn5WuX5+fnSOU9PT1y8eFHrvEajwdWrV7Xq3KuNu9/DGEYnECtXrsTnn3+OLVu2YMOGDdi0aRNWrVqFiooKo9+MiIjoUaBkAqFWq+Hs7Kx1yB2+AAAfHx94enpqPfm6sLAQe/fuRVBQEAAgKCgIBQUFSE9Pl+rs2LEDFRUVaNu2rVTn999/R1lZmVRn27ZtaNq0qdHDF4CMBCI3N1frWRfBwcFQqVQ4f/680W9GRERE91dUVISMjAxkZGQAuDNxMiMjA7m5uVCpVBg3bhw+/PBDbNy4EVlZWRg6dCjq168vrdRo3rw5evTogTfeeAP79u3DH3/8gVGjRmHgwIGoX78+AGDw4MGws7NDZGQkjhw5goSEBCxcuBDR0dGyYjV6EqVGo4G9vb1Wma2trVYGQ0REZBlM8yyMP//8E507d5Z+rvylHhERgfj4eEyePBnFxcUYPnw4CgoK0KFDB/z2229av59XrVqFUaNGoWvXrrCyskJYWBgWLVoknXdxccHWrVsRFRWF1q1bo06dOpg2bZrWXhHGMHofCCsrK/Ts2VOr22XTpk3o0qWL1l4Q3AeCSDncB4Lo3qp7H4hiTYpibTnadFKsLXNidA9ERESEXtkrr7yiaDBERETmgQ/TMsToBIL7PxAREVEl2RtJERERWTq5D8H6L2ICQUREpMc0kygfJbxDREREJBt7IIiIiHRwCMMwJhBEREQ6VComEIZwCIOIiIhkYw8EERGRHvZAGMIEgoiISIeKHfQG8Q4RERGRbOyBICIi0sMhDEOYQBAREengKgzDmEAQERHpYQJhCOdAEBERkWzsgSAiItLBVRiGMYEgIiLSwyEMQ5hiERERkWzsgSAiItLBh2kZxgSCiIhIB5dxGsYhDCIiIpKNPRBERER6+Pe1IUwgiIiIdHAOhGFMsYiIiEg29kAQERHpYQ+EIUwgiIiIdHAVhmFMIIiIiPRwhN8Q3iEiIiKSjT0QREREOrgKwzCVEEKYOggyHyUlJYiNjUVMTAzUarWpwyEyC/xeEOljAkFaCgsL4eLiguvXr8PZ2dnU4RCZBX4viPRxDgQRERHJxgSCiIiIZGMCQURERLIxgSAtarUa06dP50Qxorvwe0Gkj5MoiYiISDb2QBAREZFsTCCIiIhINiYQREREJBsTCAujUqmwYcMGk7x3cnIyVCoVCgoK/rVeo0aNsGDBgocSE/03mfJ7oCR+V8icMYGoorS0NFhbWyM0NFT2tab8R+HVV1+FSqWCSqWCnZ0dfH198cEHH0Cj0Txw2+3bt8eFCxfg4uICAIiPj4erq6tevf3792P48OEP/H7/5siRIwgLC0OjRo2gUqn4j3A1edS/B3FxcVrlGzZsMMljnE35XQGAtWvXolmzZrC3t4e/vz9++eWXan9PevQxgaiir7/+GqNHj8bvv/+O8+fPmzocWXr06IELFy7g+PHjmDBhAmbMmIGPPvrogdu1s7ODp6enwX+A69atixo1ajzw+/2bmzdvonHjxoiLi4Onp2e1vtd/2aP8PbC3t8ecOXNw7do1U4dyXw/ju5KamopBgwYhMjISBw8eRL9+/dCvXz8cPny4Wt+XLIAg2W7cuCGcnJzE0aNHxcsvvyxmzZqlV2fjxo0iMDBQqNVq4ebmJvr16yeEEKJTp04CgNYhhBDTp08X//vf/7TamD9/vvD29pZ+3rdvnwgODhZubm7C2dlZdOzYUaSnp2tdA0CsX7/+vrFHRESIvn37apV169ZNtGvXTgghxNWrV8WQIUOEq6urcHBwED169BDHjh2T6p45c0b07t1buLq6iho1agg/Pz+RmJgohBBi586dAoC4du2a9PruY/r06UIIIby9vcX8+fOFEEIMGjRIDBgwQCue0tJS4ebmJlasWCGEEKK8vFzMnj1bNGrUSNjb24uWLVuKtWvX3vcz6rr7/Ug5j/r3oHfv3qJZs2Zi0qRJUvn69euF7j+Lu3btEh06dBD29vaiQYMGYvTo0aKoqEg6f/78edGrVy9hb28vGjVqJFatWqX339y8efPEk08+KWrUqCEaNGggRo4cKW7cuCGEECb/rgwYMECEhoZqlbVt21a8+eab/3odEXsgqmDNmjVo1qwZmjZtildeeQXffPMNxF3baSQmJuKFF15Ar169cPDgQSQlJaFNmzYAgHXr1qFBgwb44IMPcOHCBVy4cMHo971x4wYiIiKwe/du7NmzB0888QR69eqFGzduPNDncXBwQGlpKYA7Xbt//vknNm7ciLS0NAgh0KtXL5SVlQEAoqKiUFJSgt9//x1ZWVmYM2cOnJyc9Nps3749FixYAGdnZ+lzTpw4Ua9eeHg4Nm3ahKKiIqlsy5YtuHnzJl544QUAQGxsLFauXIklS5bgyJEjGD9+PF555RWkpKRI1zRq1AgzZsx4oPtA8jzq3wNra2vMnj0bn376Kc6ePXvPOidPnkSPHj0QFhaGQ4cOISEhAbt378aoUaOkOkOHDsX58+eRnJyMn376CV999RUuXryo1Y6VlRUWLVqEI0eOYMWKFdixYwcmT54MwPTflbS0NAQHB2u9V0hICNLS0oy8k/SfZeIE5pHUvn17sWDBAiGEEGVlZaJOnTpi586d0vmgoCARHh5+3+vv9RexMX956SovLxc1a9YUmzZtksogoweioqJCbNu2TajVajFx4kRx7NgxAUD88ccfUv3Lly8LBwcHsWbNGiGEEP7+/mLGjBn3bPvuHgghhFi+fLlwcXHRq3f356+8fytXrpTODxo0SLz88stCCCFu374tatSoIVJTU7XaiIyMFIMGDZJ+7tKli/j000/vGRd7IKqHpXwP2rVrJ4YNGyaE0O+BiIyMFMOHD9e6dteuXcLKykrcunVL/PXXXwKA2L9/v3T++PHjAsC//je3du1a4ebmJv1syu+Kra2tWL16tdY1n332mXB3d79v/ERCCGFjwtzlkZSTk4N9+/Zh/fr1AAAbGxu8/PLL+Prrr/Hcc88BADIyMvDGG28o/t75+fmYOnUqkpOTcfHiRZSXl+PmzZvIzc2V1c7mzZvh5OSEsrIyVFRUYPDgwZgxYwaSkpJgY2ODtm3bSnXd3NzQtGlT/PXXXwCAMWPGYOTIkdi6dSuCg4MRFhaGli1bVvkz2djYYMCAAVi1ahWGDBmC4uJi/Pzzz/jhhx8AACdOnMDNmzfRrVs3retKS0vRqlUr6eekpKQqx0DyWcL3oNKcOXPQpUuXe/7Vn5mZiUOHDmHVqlVSmRACFRUVOH36NI4dOwYbGxsEBARI5319fVGrVi2tdrZv347Y2FgcPXoUhYWF0Gg0uH37Nm7evGn0HAd+V8jcMIGQ6euvv4ZGo0H9+vWlMiEE1Go1Fi9eDBcXFzg4OMhu18rKSqv7F4A0bFApIiICV65cwcKFC+Ht7Q21Wo2goCBp+MFYnTt3xhdffAE7OzvUr18fNjbG/2fw+uuvIyQkBImJidi6dStiY2Mxb948jB49WlYMdwsPD0enTp1w8eJFbNu2DQ4ODujRowcASN21iYmJeOyxx7Su43MJTMcSvgeVOnbsiJCQEMTExODVV1/VOldUVIQ333wTY8aM0buuYcOGOHbsmMH2z5w5g969e2PkyJGYNWsWateujd27dyMyMhKlpaWyJklWx3fF09MT+fn5WmX5+fmcfEwGcQ6EDBqNBitXrsS8efOQkZEhHZmZmahfvz6+//57AEDLli3/Ncu3s7NDeXm5VlndunWRl5en9Y9nRkaGVp0//vgDY8aMQa9evdCiRQuo1WpcvnxZ9udwdHSEr68vGjZsqJU8NG/eHBqNBnv37pXKrly5gpycHPj5+UllXl5eGDFiBNatW4cJEyZg6dKlRn/Oe2nfvj28vLyQkJCAVatW4aWXXoKtrS0AwM/PD2q1Grm5ufD19dU6vLy8ZH92enCW8j24W1xcHDZt2qQ37h8QEIDs7Gy9//Z8fX1hZ2eHpk2bQqPR4ODBg9I1J06c0FrZkZ6ejoqKCsybNw/t2rVDkyZN9FasmPK7EhQUpPf/07Zt2xAUFGQwHvqPM+HwySNn/fr1ws7OThQUFOidmzx5sggMDBRC3JkLYGVlJaZNmyays7PFoUOHRFxcnFS3W7duok+fPuLs2bPi0qVLQgghsrOzhUqlEnFxceLEiRNi8eLFolatWlpjv61atRLdunUT2dnZYs+ePeLZZ58VDg4OWmOtqMIqjLv17dtX+Pn5iV27domMjAzRo0cP4evrK0pLS4UQQowdO1b89ttv4tSpUyI9PV20bdtWmhmuOwfijz/+EADE9u3bxaVLl0RxcbEQ4t5j3++++67w8/MTNjY2YteuXXrn3NzcRHx8vDhx4oRIT08XixYtEvHx8VId3XHdkpIScfDgQXHw4EFRr149MXHiRHHw4EFx/Pjx+352Mo6lfg+GDBki7O3tteZAZGZmCgcHBxEVFSUOHjwojh07JjZs2CCioqKkOsHBwSIgIEDs3btXHDhwQHTu3Fk4ODhI80MyMjIEALFgwQJx8uRJsXLlSvHYY4+ZzXfljz/+EDY2NuLjjz8Wf/31l5g+fbqwtbUVWVlZ971/REIIwQRCht69e4tevXrd89zevXsFAJGZmSmEEOKnn34STz31lLCzsxN16tQR/fv3l+qmpaWJli1bCrVarfWP1RdffCG8vLyEo6OjGDp0qJg1a5bWP5wHDhwQgYGBwt7eXjzxxBNi7dq1ev/APGgCUbmM08XFRTg4OIiQkBCtZZyjRo0Sjz/+uFCr1aJu3bpiyJAh4vLly0II/QRCCCFGjBgh3Nzc7rs0rVJ2drYAILy9vUVFRYXWuYqKCrFgwQLRtGlTYWtrK+rWrStCQkJESkqKVMfb21tqXwghTp8+rbc0DoDo1KnTfT87GcdSvwenT58WdnZ2ess49+3bJ7p16yacnJyEo6OjaNmypdaS1fPnz4uePXsKtVotvL29xerVq4W7u7tYsmSJVOeTTz4R9erVk75TK1euNJvvihBCrFmzRjRp0kTY2dmJFi1aSEuzif4NH+dNRKSgs2fPwsvLC9u3b0fXrl1NHQ5RtWECQUT0AHbs2IGioiL4+/vjwoULmDx5Ms6dO4djx45J8xOILBFXYRARPYCysjK88847OHXqFGrWrIn27dtj1apVTB7I4rEHgoiIiGTjMk4iIiKSjQkEERERycYEgoiIiGRjAkFERESyMYEgIiIi2ZhAEBERkWxMIIiIiEg2JhBEREQkGxMIIiIiku3/AakZCL5evca/AAAAAElFTkSuQmCC",
      "text/plain": [
       "<Figure size 640x480 with 2 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# visualize confusion matrix with seaborn heatmap\n",
    "cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'], \n",
    "                                 index=['Predict Positive:1', 'Predict Negative:0'])\n",
    "\n",
    "sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **16. Classification metrices** <a class=\"anchor\" id=\"16\"></a>\n",
    "\n",
    "[Table of Contents](#0.1)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Classification Report\n",
    "\n",
    "\n",
    "**Classification report** is another way to evaluate the classification model performance. It displays the  **precision**, **recall**, **f1** and **support** scores for the model. I have described these terms in later.\n",
    "\n",
    "We can print a classification report as follows:-"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "       <=50K       0.93      0.81      0.86      7407\n",
      "        >50K       0.57      0.80      0.67      2362\n",
      "\n",
      "    accuracy                           0.81      9769\n",
      "   macro avg       0.75      0.81      0.77      9769\n",
      "weighted avg       0.84      0.81      0.82      9769\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import classification_report\n",
    "\n",
    "print(classification_report(y_test, y_pred))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Classification accuracy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {
    "trusted": true
   },
   "outputs": [],
   "source": [
    "TP = cm[0,0]\n",
    "TN = cm[1,1]\n",
    "FP = cm[0,1]\n",
    "FN = cm[1,0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Classification accuracy : 0.8083\n"
     ]
    }
   ],
   "source": [
    "# print classification accuracy\n",
    "classification_accuracy = (TP + TN) / float(TP + TN + FP + FN)\n",
    "\n",
    "print('Classification accuracy : {0:0.4f}'.format(classification_accuracy))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Classification error"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Classification error : 0.1917\n"
     ]
    }
   ],
   "source": [
    "# print classification error\n",
    "classification_error = (FP + FN) / float(TP + TN + FP + FN)\n",
    "\n",
    "print('Classification error : {0:0.4f}'.format(classification_error))\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Precision\n",
    "\n",
    "\n",
    "**Precision** can be defined as the percentage of correctly predicted positive outcomes out of all the predicted positive outcomes. It can be given as the ratio of true positives (TP) to the sum of true and false positives (TP + FP). \n",
    "\n",
    "\n",
    "So, **Precision** identifies the proportion of correctly predicted positive outcome. It is more concerned with the positive class than the negative class.\n",
    "\n",
    "\n",
    "\n",
    "Mathematically, precision can be defined as the ratio of `TP to (TP + FP)`.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Precision : 0.8099\n"
     ]
    }
   ],
   "source": [
    "# print precision score\n",
    "precision = TP / float(TP + FP)\n",
    "\n",
    "print('Precision : {0:0.4f}'.format(precision))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Recall\n",
    "\n",
    "\n",
    "Recall can be defined as the percentage of correctly predicted positive outcomes out of all the actual positive outcomes.\n",
    "It can be given as the ratio of true positives (TP) to the sum of true positives and false negatives (TP + FN). **Recall** is also called **Sensitivity**.\n",
    "\n",
    "\n",
    "**Recall** identifies the proportion of correctly predicted actual positives.\n",
    "\n",
    "\n",
    "Mathematically, recall can be given as the ratio of `TP to (TP + FN)`.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Recall or Sensitivity : 0.9281\n"
     ]
    }
   ],
   "source": [
    "recall = TP / float(TP + FN)\n",
    "\n",
    "print('Recall or Sensitivity : {0:0.4f}'.format(recall))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### True Positive Rate\n",
    "\n",
    "\n",
    "**True Positive Rate** is synonymous with **Recall**.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True Positive Rate : 0.9281\n"
     ]
    }
   ],
   "source": [
    "true_positive_rate = TP / float(TP + FN)\n",
    "\n",
    "\n",
    "print('True Positive Rate : {0:0.4f}'.format(true_positive_rate))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### False Positive Rate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "False Positive Rate : 0.4260\n"
     ]
    }
   ],
   "source": [
    "false_positive_rate = FP / float(FP + TN)\n",
    "\n",
    "\n",
    "print('False Positive Rate : {0:0.4f}'.format(false_positive_rate))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Specificity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {
    "trusted": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Specificity : 0.5740\n"
     ]
    }
   ],
   "source": [
    "specificity = TN / (TN + FP)\n",
    "\n",
    "print('Specificity : {0:0.4f}'.format(specificity))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### f1-score\n",
    "\n",
    "\n",
    "**f1-score** is the weighted harmonic mean of precision and recall. The best possible **f1-score** would be 1.0 and the worst \n",
    "would be 0.0.  **f1-score** is the harmonic mean of precision and recall. So, **f1-score** is always lower than accuracy measures as they embed precision and recall into their computation. The weighted average of `f1-score` should be used to \n",
    "compare classifier models, not global accuracy.\n",
    "\n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Support\n",
    "\n",
    "\n",
    "**Support** is the actual number of occurrences of the class in our dataset."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **17. Calculate class probabilities** <a class=\"anchor\" id=\"17\"></a>\n",
    "\n",
    "[Table of Contents](#0.1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "trusted": true
   },
   "outputs": [],
   "source": [
    "# print the first 10 predicted probabilities of two classes- 0 and 1\n",
    "y_pred_prob = gnb.predict_proba(X_test)[0:10]\n",
    "\n",
    "y_pred_prob"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Observations\n",
    "\n",
    "\n",
    "- In each row, the numbers sum to 1.\n",
    "\n",
    "\n",
    "- There are 2 columns which correspond to 2 classes - `<=50K` and `>50K`.\n",
    "\n",
    "    - Class 0 => <=50K - Class that a person makes less than equal to 50K.    \n",
    "    \n",
    "    - Class 1 => >50K  - Class that a person makes more than 50K. \n",
    "        \n",
    "    \n",
    "- Importance of predicted probabilities\n",
    "\n",
    "    - We can rank the observations by probability of whether a person makes less than or equal to 50K or more than 50K.\n",
    "\n",
    "\n",
    "- predict_proba process\n",
    "\n",
    "    - Predicts the probabilities    \n",
    "    \n",
    "    - Choose the class with the highest probability    \n",
    "    \n",
    "    \n",
    "- Classification threshold level\n",
    "\n",
    "    - There is a classification threshold level of 0.5.    \n",
    "    \n",
    "    - Class 0 => <=50K - probability of salary less than or equal to 50K is predicted if probability < 0.5.    \n",
    "    \n",
    "    - Class 1 => >50K - probability of salary more than 50K is predicted if probability > 0.5.    \n",
    "    \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "trusted": true
   },
   "outputs": [],
   "source": [
    "# store the probabilities in dataframe\n",
    "y_pred_prob_df = pd.DataFrame(data=y_pred_prob, columns=['Prob of - <=50K', 'Prob of - >50K'])\n",
    "\n",
    "y_pred_prob_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "trusted": true
   },
   "outputs": [],
   "source": [
    "# print the first 10 predicted probabilities for class 1 - Probability of >50K\n",
    "gnb.predict_proba(X_test)[0:10, 1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "trusted": true
   },
   "outputs": [],
   "source": [
    "# store the predicted probabilities for class 1 - Probability of >50K\n",
    "y_pred1 = gnb.predict_proba(X_test)[:, 1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "trusted": true
   },
   "outputs": [],
   "source": [
    "# plot histogram of predicted probabilities\n",
    "# adjust the font size \n",
    "plt.rcParams['font.size'] = 12\n",
    "\n",
    "# plot histogram with 10 bins\n",
    "plt.hist(y_pred1, bins = 10)\n",
    "\n",
    "# set the title of predicted probabilities\n",
    "plt.title('Histogram of predicted probabilities of salaries >50K')\n",
    "\n",
    "# set the x-axis limit\n",
    "plt.xlim(0,1)\n",
    "\n",
    "# set the title\n",
    "plt.xlabel('Predicted probabilities of salaries >50K')\n",
    "plt.ylabel('Frequency')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Observations\n",
    "\n",
    "\n",
    "- We can see that the above histogram is highly positive skewed.\n",
    "\n",
    "\n",
    "- The first column tell us that there are approximately 5700 observations with probability between 0.0 and 0.1 whose salary \n",
    "  is <=50K.\n",
    "\n",
    "\n",
    "- There are relatively small number of observations with probability > 0.5.\n",
    "\n",
    "\n",
    "- So, these small number of observations predict that the salaries will be >50K.\n",
    "\n",
    "\n",
    "- Majority of observations predcit that the salaries will be <=50K."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **18. ROC - AUC** <a class=\"anchor\" id=\"18\"></a>\n",
    "\n",
    "[Table of Contents](#0.1)\n",
    "\n",
    "\n",
    "\n",
    "### ROC Curve\n",
    "\n",
    "\n",
    "Another tool to measure the classification model performance visually is **ROC Curve**. ROC Curve stands for **Receiver Operating Characteristic Curve**. An **ROC Curve** is a plot which shows the performance of a classification model at various \n",
    "classification threshold levels. \n",
    "\n",
    "\n",
    "\n",
    "The **ROC Curve** plots the **True Positive Rate (TPR)** against the **False Positive Rate (FPR)** at various threshold levels.\n",
    "\n",
    "\n",
    "\n",
    "**True Positive Rate (TPR)** is also called **Recall**. It is defined as the ratio of `TP to (TP + FN)`.\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "**False Positive Rate (FPR)** is defined as the ratio of `FP to (FP + TN)`.\n",
    "\n",
    "\n",
    "\n",
    "In the ROC Curve, we will focus on the TPR (True Positive Rate) and FPR (False Positive Rate) of a single point. This will give us the general performance of the ROC curve which consists of the TPR and FPR at various threshold levels. So, an ROC Curve plots TPR vs FPR at different classification threshold levels. If we lower the threshold levels, it may result in more items being classified as positve. It will increase both True Positives (TP) and False Positives (FP).\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "trusted": true
   },
   "outputs": [],
   "source": [
    "# plot ROC Curve\n",
    "\n",
    "from sklearn.metrics import roc_curve\n",
    "\n",
    "fpr, tpr, thresholds = roc_curve(y_test, y_pred1, pos_label = '>50K')\n",
    "\n",
    "plt.figure(figsize=(6,4))\n",
    "\n",
    "plt.plot(fpr, tpr, linewidth=2)\n",
    "\n",
    "plt.plot([0,1], [0,1], 'k--' )\n",
    "\n",
    "plt.rcParams['font.size'] = 12\n",
    "\n",
    "plt.title('ROC curve for Gaussian Naive Bayes Classifier for Predicting Salaries')\n",
    "\n",
    "plt.xlabel('False Positive Rate (1 - Specificity)')\n",
    "\n",
    "plt.ylabel('True Positive Rate (Sensitivity)')\n",
    "\n",
    "plt.show()\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "ROC curve help us to choose a threshold level that balances sensitivity and specificity for a particular context."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ROC  AUC\n",
    "\n",
    "\n",
    "**ROC AUC** stands for **Receiver Operating Characteristic - Area Under Curve**. It is a technique to compare classifier performance. In this technique, we measure the `area under the curve (AUC)`. A perfect classifier will have a ROC AUC equal to 1, whereas a purely random classifier will have a ROC AUC equal to 0.5. \n",
    "\n",
    "\n",
    "So, **ROC AUC** is the percentage of the ROC plot that is underneath the curve."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "trusted": true
   },
   "outputs": [],
   "source": [
    "# compute ROC AUC\n",
    "\n",
    "from sklearn.metrics import roc_auc_score\n",
    "\n",
    "ROC_AUC = roc_auc_score(y_test, y_pred1)\n",
    "\n",
    "print('ROC AUC : {:.4f}'.format(ROC_AUC))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Interpretation\n",
    "\n",
    "\n",
    "- ROC AUC is a single number summary of classifier performance. The higher the value, the better the classifier.\n",
    "\n",
    "- ROC AUC of our model approaches towards 1. So, we can conclude that our classifier does a good job in predicting whether it will rain tomorrow or not."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "trusted": true
   },
   "outputs": [],
   "source": [
    "# calculate cross-validated ROC AUC \n",
    "\n",
    "from sklearn.model_selection import cross_val_score\n",
    "\n",
    "Cross_validated_ROC_AUC = cross_val_score(gnb, X_train, y_train, cv=5, scoring='roc_auc').mean()\n",
    "\n",
    "print('Cross validated ROC AUC : {:.4f}'.format(Cross_validated_ROC_AUC))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **19. k-Fold Cross Validation** <a class=\"anchor\" id=\"19\"></a>\n",
    "\n",
    "[Table of Contents](#0.1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "trusted": true
   },
   "outputs": [],
   "source": [
    "# Applying 10-Fold Cross Validation\n",
    "\n",
    "from sklearn.model_selection import cross_val_score\n",
    "\n",
    "scores = cross_val_score(gnb, X_train, y_train, cv = 10, scoring='accuracy')\n",
    "\n",
    "print('Cross-validation scores:{}'.format(scores))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can summarize the cross-validation accuracy by calculating its mean."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "trusted": true
   },
   "outputs": [],
   "source": [
    "# compute Average cross-validation score\n",
    "\n",
    "print('Average cross-validation score: {:.4f}'.format(scores.mean()))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Interpretation\n",
    "\n",
    "\n",
    "- Using the mean cross-validation, we can conclude that we expect the model to be around 80.63% accurate on average.\n",
    "\n",
    "- If we look at all the 10 scores produced by the 10-fold cross-validation, we can also conclude that there is a relatively small variance in the accuracy between folds, ranging from 81.35% accuracy to 79.64% accuracy. So, we can conclude that the model is independent of the particular folds used for training.\n",
    "\n",
    "- Our original model accuracy is 0.8083, but the mean cross-validation accuracy is 0.8063. So, the 10-fold cross-validation accuracy does not result in performance improvement for this model."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **20. Results and conclusion** <a class=\"anchor\" id=\"20\"></a>\n",
    "\n",
    "[Table of Contents](#0.1)\n",
    "\n",
    "\n",
    "1.\tIn this project, I build a Gaussian Naïve Bayes Classifier model to predict whether a person makes over 50K a year. The model yields a very good performance as indicated by the model accuracy which was found to be 0.8083.\n",
    "2.\tThe training-set accuracy score is 0.8067 while the test-set accuracy to be 0.8083. These two values are quite comparable. So, there is no sign of overfitting.\n",
    "3.\tI have compared the model accuracy score which is 0.8083 with null accuracy score which is 0.7582. So, we can conclude that our Gaussian Naïve Bayes classifier model is doing a very good job in predicting the class labels.\n",
    "4.\tROC AUC of our model approaches towards 1. So, we can conclude that our classifier does a very good job in predicting whether a person makes over 50K a year.\n",
    "5.\tUsing the mean cross-validation, we can conclude that we expect the model to be around 80.63% accurate on average.\n",
    "6.\tIf we look at all the 10 scores produced by the 10-fold cross-validation, we can also conclude that there is a relatively small variance in the accuracy between folds, ranging from 81.35% accuracy to 79.64% accuracy. So, we can conclude that the model is independent of the particular folds used for training.\n",
    "7.\tOur original model accuracy is 0.8083, but the mean cross-validation accuracy is 0.8063. So, the 10-fold cross-validation accuracy does not result in performance improvement for this model."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **21. References** <a class=\"anchor\" id=\"21\"></a>\n",
    "\n",
    "[Table of Contents](#0.1)\n",
    "\n",
    "\n",
    "\n",
    "The work done in this project is inspired from following books and websites:-\n",
    "\n",
    "1. Hands on Machine Learning with Scikit-Learn and Tensorflow by Aurélién Géron\n",
    "\n",
    "2. Introduction to Machine Learning with Python by Andreas C. Müller and Sarah Guido\n",
    "\n",
    "3. Udemy course – Machine Learning – A Z by Kirill Eremenko and Hadelin de Ponteves\n",
    "\n",
    "4. https://en.wikipedia.org/wiki/Naive_Bayes_classifier\n",
    "\n",
    "5. http://dataaspirant.com/2017/02/06/naive-bayes-classifier-machine-learning/\n",
    "\n",
    "6. https://www.datacamp.com/community/tutorials/naive-bayes-scikit-learn\n",
    "\n",
    "7. https://stackabuse.com/the-naive-bayes-algorithm-in-python-with-scikit-learn/\n",
    "\n",
    "8. https://jakevdp.github.io/PythonDataScienceHandbook/05.05-naive-bayes.html"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, now we will come to the end of this kernel.\n",
    "\n",
    "I hope you find this kernel useful and enjoyable.\n",
    "\n",
    "Your comments and feedback are most welcome.\n",
    "\n",
    "Thank you\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Go to Top](#0)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
